## Machine learning exercise
### TJ Slezak

Design and create a machine learning model that classifies a name as either male or female based on the characteristics of the name. Write up a report with screenshots in a text document or Jupyter notebook that describes the following:

•	the input data and its quality

•	any data exploration and any interesting insights

•	any preprocessing or cleaning of the data that was needed, including how you chose to label the data (i.e., how you labeled a name male or female as the ground truth for the model)

•	the features used by your model and why you chose them

•	your selection of machine learning algorithm and why it was chosen

•	the accuracy of your model and any relevant metrics that describe model performance

•	any model parameter tuning used to improve performance

•	any relevant discussion, model interpretation, future steps, or further exploration needed

In addition, prepare a short slideshow presentation (~10 min.) of an overview of the machine learning exercise. You will present this to the analytics team.
There are no restrictions on the classification model that you use; however, be sure to keep in mind your ability to adequately explain and interpret the results.
Good luck!


In [28]:
import pandas as pd
import numpy as np
import hvplot.pandas
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

### Perform Preprocessing steps

In [29]:
df = pd.read_csv('./data/names_df.csv')
df['Name'] = df.Name.str.strip('--')
df['Number'] = df.Number.str.strip('L00').astype(np.int)
df['Year'] = df.Year.str.strip('Y').astype(np.int)
idx = df[df.Year == 88].index[0]
df.at[idx, 'Year'] = 1888

In [30]:
df['Sex'] = df.Sex.apply(lambda x: 0 if x=='F' else 1)
df.drop(columns=['Year'], inplace=True)
groups = df.groupby(['Sex', 'Name'], as_index=False).sum()
group = df.groupby(['Name']).sum()
group.drop(columns='Sex', inplace=True)

### Frequency of names by sex

In [31]:
male = groups[groups.Sex==1].drop(columns=['Sex']).set_index('Name')
fem = groups[groups.Sex==0].drop(columns=['Sex']).set_index('Name')

for name, row in group.iterrows():
    total = row.Number
    try: 
        n_male = male.at[name, 'Number']
        group.loc[name, 'Male'] = int(n_male)
    except:
        group.loc[name, 'Male'] = 0
        
group['Female'] = group.Number - group.Male
group.head()

Unnamed: 0_level_0,Number,Male,Female
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aaban,107,107.0,0.0
Aabha,35,0.0,35.0
Aabid,10,10.0,0.0
Aabir,5,5.0,0.0
Aabriella,32,0.0,32.0


#### The code below shows the n-grams for Male names and their frequency.

In [22]:
male_vect = CountVectorizer(analyzer='char_wb', ngram_range=(2,8))
male_counts = male_vect.fit_transform(male.index)
male_counts = male_counts.sum(axis=0).A1
m_vocab = male_vect.get_feature_names()
mdf = pd.Series(male_counts, index=m_vocab)
mdf.head()

 a        3682
 aa        157
 aab         3
 aaba        1
 aaban       1
dtype: int64

#### The code below shows the n-grams for Male names and their frequency.

In [34]:
fem_vect = CountVectorizer(analyzer='char_wb', ngram_range=(2,8))
fem_counts = fem_vect.fit_transform(fem.index)
fem_counts = fem_counts.sum(axis=0).A1
f_vocab = fem_vect.get_feature_names()
fdf = pd.Series(fem_counts, index=f_vocab)
fdf.head()

 a        7604
 aa        257
 aab         2
 aabh        1
 aabha       1
dtype: int64

### Vocab for all n-grams

In [35]:
vocab = set(m_vocab) | set(f_vocab)
vocab = sorted(vocab)
print(vocab[:50])

[' a', ' aa', ' aab', ' aaba', ' aaban', ' aaban ', ' aabh', ' aabha', ' aabha ', ' aabi', ' aabid', ' aabid ', ' aabir', ' aabir ', ' aabr', ' aabri', ' aabrie', ' aabriel', ' aad', ' aada', ' aada ', ' aadam', ' aadam ', ' aadan', ' aadan ', ' aadar', ' aadars', ' aadarsh', ' aade', ' aaden', ' aaden ', ' aades', ' aadesh', ' aadesh ', ' aadh', ' aadha', ' aadhav', ' aadhav ', ' aadhava', ' aadhi', ' aadhi ', ' aadhir', ' aadhira', ' aadhv', ' aadhvi', ' aadhvik', ' aadhy', ' aadhya', ' aadhya ', ' aadhyan']


In [36]:
d = dict()
for v in vocab:
    try:
        m = mdf.loc[v]
    except:
        m = 0
    try:
        f = fdf.loc[v]
    except:
        f = 0
        
    d[v] = [m, f]

### Probability of sex by n-gram

In [32]:
p = pd.DataFrame.from_dict(d, orient='index', columns=['Male', 'Female'])
p['p_male'] = p.Male / (p.Male + p.Female)
p['p_female'] = p.Female / (p.Male + p.Female)
p.head()

Unnamed: 0,Male,Female,p_male,p_female
a,3682,7604,0.326245,0.673755
aa,157,257,0.379227,0.620773
aab,3,2,0.6,0.4
aaba,1,0,1.0,0.0
aaban,1,0,1.0,0.0


#### Compute priors (the probability a name will be Female or Male if selected randomly, regardless of features)

In [33]:
T = len(df.Sex)
M = df.Sex.sum()
F = T - M
priors = [F/T, M/T]
priors

[0.5914239620921043, 0.40857603790789565]

#### Find minimum and maximum length of names in dataset (bounds for n-gram selection)

In [39]:
df.Name.str.len().min(), df.Name.str.len().max()

(2, 15)

## Modeling

#### Imports

In [41]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import SGDClassifier, LogisticRegressionCV
from sklearn.pipeline import Pipeline

#### Separate response and features for modeling

In [37]:
y = df.Sex
X = df.Name

0    0
1    0
2    0
3    0
4    0
Name: Sex, dtype: int64

### Train-test split of 80/20. 

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)

## Model 1 - Simple tokenizer.
Name Features: 
1. First letter
2. Last letter
3. Last two letters
4. all other letters

In [47]:
def tokenizer(name):
    features = list()
    features.append(name[0])
    features.append(name[-1].lower())
    features.append(name[-2:].lower())
    features.extend(name[1:-2].lower().split())
    return features

In [48]:
name = 'Thomas'
tokenizer(name)

['T', 's', 'as', 'hom']

In [49]:
%%time
vect = CountVectorizer(tokenizer=tokenizer, lowercase=False)
X_train_counts = vect.fit_transform(X_train)

mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = vect.transform(X_test)

cv_scores = cross_val_score(mnb, X_train_counts, y_train, cv=10)

CPU times: user 43.8 s, sys: 1.23 s, total: 45 s
Wall time: 11.5 s


In [50]:
cv_scores

array([0.8229831 , 0.8227493 , 0.82398976, 0.82451469, 0.82294298,
       0.82350802, 0.8243783 , 0.822917  , 0.82287154, 0.8234744 ])

In [51]:
score = mnb.score(X_test_counts, y_test)
score

0.8240109317725423

In [52]:
%%time
vect = CountVectorizer(tokenizer=tokenizer, lowercase=False)
X_train_counts = vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_counts = tfidf_transformer.fit_transform(X_train_counts)

mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = vect.transform(X_test)
X_test_counts = tfidf_transformer.transform(X_test_counts)

cv_scores = cross_val_score(mnb, X_train_counts, y_train, cv=10)

CPU times: user 44.1 s, sys: 1.14 s, total: 45.3 s
Wall time: 11.6 s


In [54]:
cv_scores

array([0.82739943, 0.82681492, 0.8270747 , 0.8281387 , 0.82689822,
       0.827775  , 0.82690472, 0.82715151, 0.82653452, 0.82662432])

In [55]:
score = mnb.score(X_test_counts, y_test)
score

0.8270867917273915

In [None]:
def cap_last(name):
    nam = name[:-1]
    e = name[-1:].upper()
    return nam + e


In [29]:
%%time
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test)

cv_scores = cross_val_score(model_mnb, X_train_counts, y_train, cv=10)

CPU times: user 1min 46s, sys: 8.83 s, total: 1min 54s
Wall time: 1min 8s


In [30]:
score = model_mnb.score(X_test_counts, y_test)

score, np.mean(cv_scores), np.median(cv_scores)

(0.8927969282966126, 0.8930034567635445, 0.8929422690991278)

In [32]:
%%time
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))
X_train_counts = cwb_vectorizer.fit_transform(X_train)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test)

cv_scores2 = cross_val_score(model_mnb, X_train_counts, y_train, cv=10)

CPU times: user 1min 48s, sys: 9.09 s, total: 1min 57s
Wall time: 1min 11s


In [33]:
score2 = model_mnb.score(X_test_counts, y_test)

score2, np.mean(cv_scores2), np.median(cv_scores2)

(0.8931840086456604, 0.8932658399811044, 0.8932312808089731)

In [34]:
%%time
mnb_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('clf', MultinomialNB(alpha=0.001)),
])

mnb_pipe.fit(X_train, y_train)

CPU times: user 43.3 s, sys: 1.58 s, total: 44.8 s
Wall time: 41.1 s


In [37]:
score = model_mnb.score(X_test_counts, y_test)
score

0.8931840086456604

In [48]:
mnb_pipe2 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha=0.001)),
])

mnb_pipe2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 10), preprocessor=None, stop_words=None,
        ...ear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True))])

In [49]:
mnb_pipe2.score(X_test, y_test)

0.8951427910831236

In [40]:
%%time
logrcv_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, fit_intercept=False, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe.fit(X_train, y_train)

CPU times: user 49min 41s, sys: 4.19 s, total: 49min 45s
Wall time: 47min 37s


In [41]:
logrcv_pipe.score(X_test, y_test)

0.9006164709183156

In [42]:
%%time
logrcv_pipe2 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('tfidf', TfidfTransformer()),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, fit_intercept=False, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe2.fit(X_train, y_train)

CPU times: user 12min 16s, sys: 6.79 s, total: 12min 23s
Wall time: 17min 35s


In [43]:
logrcv_pipe2.score(X_test, y_test)

0.9006970044137551

In [44]:
%%time
logrcv_pipe3 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('tfidf', TfidfTransformer()),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, fit_intercept=True, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe3.fit(X_train, y_train)

CPU times: user 31min 57s, sys: 4.65 s, total: 32min 1s
Wall time: 45min 13s


In [45]:
logrcv_pipe3.score(X_test, y_test)

0.9006736237215308

In [49]:
%%time
logrcv_pipe4 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))),
    ('tfidf', TfidfTransformer()),
    ('logrCV', LogisticRegressionCV(cv=10, random_state=11, fit_intercept=False, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe4.fit(X_train, y_train)

CPU times: user 10min 59s, sys: 6.77 s, total: 11min 6s
Wall time: 31min 10s


In [50]:
logrcv_pipe4.score(X_test, y_test)

0.9006866129949888

In [51]:
%%time
logrcv_pipe5 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,12))),
    ('tfidf', TfidfTransformer()),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, fit_intercept=False, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe5.fit(X_train, y_train)

CPU times: user 13min 9s, sys: 6.76 s, total: 13min 16s
Wall time: 19min 37s


In [52]:
logrcv_pipe5.score(X_test, y_test)

0.9006840151402972

In [53]:
%%time
logrcv_pipe6 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,15))),
    ('tfidf', TfidfTransformer()),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, fit_intercept=False, max_iter=5000, n_jobs=-1)),
])

logrcv_pipe6.fit(X_train, y_train)

CPU times: user 11min 47s, sys: 5.89 s, total: 11min 53s
Wall time: 19min 34s


In [54]:
logrcv_pipe6.score(X_test, y_test)

0.9006918087043719

In [45]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,10))
X_train_counts = cwb_vectorizer.fit_transform(X_train)

model_gnb = MultinomialNB().fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test)

model_gnb.score(X_test_counts, y_test)

0.8877882644512162

## Simple Model - Character Frequency

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, ComplementNB
from sklearn.linear_model import SGDClassifier, LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)

In [50]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1,1))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB().fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.6690333122907103

In [65]:
X_train_counts

<1539732x763325 sparse matrix of type '<class 'numpy.int64'>'
	with 46601657 stored elements in Compressed Sparse Row format>

In [53]:
print(cwb_vectorizer.get_feature_names())

[' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [54]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB().fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.7864589422055267

In [59]:
print(cwb_vectorizer.get_feature_names()[:100])

[' a', ' b', ' c', ' d', ' e', ' f', ' g', ' h', ' i', ' j', ' k', ' l', ' m', ' n', ' o', ' p', ' q', ' r', ' s', ' t', ' u', ' v', ' w', ' x', ' y', ' z', 'a ', 'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay', 'az', 'b ', 'ba', 'bb', 'bc', 'bd', 'be', 'bg', 'bh', 'bi', 'bj', 'bl', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bu', 'bw', 'by', 'c ', 'ca', 'cb', 'cc', 'cd', 'ce', 'cg', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm', 'cn', 'co', 'cp', 'cq', 'cr', 'cs', 'ct', 'cu', 'cx', 'cy', 'cz', 'd ', 'da', 'db']


In [60]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB().fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.8870738544110274

In [61]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB().fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.8887624599605646

In [62]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.8927969282966126

In [63]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_mnb.score(X_test_counts, y_test)

0.8931840086456604

In [186]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1,2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7537207773820379

In [184]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1,3))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8053063779930534

In [197]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(1,4))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8391668160433114

0.8663975289206174

In [215]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,3))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8400007273993136

In [199]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,4))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.843866335180408

0.8671223303795725

In [219]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,5))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8816027724305269

In [220]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,6))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8885884036962276

In [227]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,7))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8915525558993384

In [234]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279

0.8925449363915279

In [235]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.01).fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279  alpha = 1.0
#0.8963300106771828  alpha = 0.01

0.8963300106771828

In [236]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279  alpha = 1.0
#0.896091008045556   alpha = 0.1
#0.8963300106771828  alpha = 0.01
#0.896348195660024   alpha = 0.001
#0.8962494771817433   alpha = 0.00001

0.896348195660024

In [257]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,14))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

#tfidf_transformer = TfidfTransformer()
#X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
#X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_counts, y_test)
#0.8925449363915279  alpha = 1.0
#0.896091008045556   alpha = 0.1
#0.8963300106771828  alpha = 0.01
#0.896348195660024   alpha = 0.001
#0.8965768068728843  ngram_range=(2,14)

0.89362304608854

In [274]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer(smooth_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279  alpha = 1.0
#0.896091008045556   alpha = 0.1
#0.8963300106771828  alpha = 0.01
#0.896348195660024   alpha = 0.001
#0.8965768068728843  ngram_range=(2,14)
#0.893591871832241   No tfidf
###0.8973197933146807 ngram_range=(2,10)
#0.8965794047275759 ngram_range=(2,12), smooth_idf=False
#0.8965222519243609 ngram_range=(2,10), smooth_idf=False

0.8963559892240988

In [277]:
model_mnb.class_count_

array([910303., 629429.])

In [278]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.001).fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279  alpha = 1.0
#0.896091008045556   alpha = 0.1
#0.8963300106771828  alpha = 0.01
#0.896348195660024   alpha = 0.001
#0.896296238566192   alpha = 0.0001
#0.8962494771817433  alpha = 0.00001
#0.8964936755227533  ngram_range=(2,9)

0.8965170562149777

In [237]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB(alpha=0.1).fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)
#0.8925449363915279  alpha = 1.0
#0.8963300106771828  alpha = 0.01

0.896091008045556

In [233]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,9))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8928696682279773

In [230]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,10))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8930151480907067

In [232]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,15))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8930567137657722

In [None]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

In [279]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

model_logr = LogisticRegressionCV(cv=5, random_state=11, max_iter=200, n_jobs=-1)
model_logr.fit(X_train_counts, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)

model_logr.score(X_test_counts, y_test)
# 0.89368799245583 << cv=3
# 0.8934411962601284 << cv=5
# 0.9005385352775678 << counts, ngram_range=(2,8)



0.9005385352775678

In [212]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(1,1))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_logr2 = LogisticRegressionCV(cv=5, random_state=11, max_iter=200, n_jobs=-1)
model_logr2.fit(X_train_tfidf, y_train)
X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_logr2.score(X_test_tfidf, y_test)

0.7502084778390006

In [214]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(1,2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_logr2 = LogisticRegressionCV(cv=5, random_state=11, max_iter=200, n_jobs=-1)
model_logr2.fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_logr2.score(X_test_tfidf, y_test)

0.8241330309430472

In [None]:
# parameters = {
#     'vect__max_df': (0.5, 0.75, 1.0),
#     # 'vect__max_features': (None, 5000, 10000, 50000),
#     'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
#     # 'tfidf__use_idf': (True, False),
#     # 'tfidf__norm': ('l1', 'l2'),
#     'clf__max_iter': (5,),
#     'clf__alpha': (0.00001, 0.000001),
#     'clf__penalty': ('l2', 'elasticnet'),
#     # 'clf__max_iter': (10, 50, 80),
# }

In [54]:
%%time
mnb_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('clf', MultinomialNB(alpha=0.001)),
])

mnb_pipe.fit(X_train.Name, y_train)

CPU times: user 37.4 s, sys: 1.41 s, total: 38.8 s
Wall time: 37.2 s


In [55]:
mnb_pipe.score(X_test.Name, y_test)

0.8927969282966126

In [None]:
%%time
mnb_pipe2 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(alpha=0.001)),
])

mnb_pipe2.fit(X_train.Name, y_train)

In [None]:
mnb_pipe2.score(X_test.Name, y_test)

In [56]:
%%time
cnb_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('clf', ComplementNB(alpha=0.001)),
])

cnb_pipe.fit(X_train.Name, y_train)

CPU times: user 38.7 s, sys: 1.45 s, total: 40.2 s
Wall time: 38 s


In [57]:
cnb_pipe.score(X_test.Name, y_test)

0.8926384591604253

In [58]:
%%time
sgd_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('clf', SGDClassifier(loss='log', penalty='l2',
                          alpha=0.001, random_state=11,
                          max_iter=100, tol=1e-3)),
])

sgd_pipe.fit(X_train.Name, y_train)

CPU times: user 46.2 s, sys: 1.52 s, total: 47.7 s
Wall time: 45.1 s


In [59]:
sgd_pipe.score(X_test.Name, y_test)

0.8472513398435572

In [60]:
%%time
logrcv_pipe = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('logrCV', LogisticRegressionCV(cv=5, random_state=11, max_iter=1000, n_jobs=-1)),
])

logrcv_pipe.fit(X_train.Name, y_train)

CPU times: user 17min 49s, sys: 3.12 s, total: 17min 52s
Wall time: 52min 20s


In [61]:
logrcv_pipe.score(X_test.Name, y_test)

0.9006684280121475

In [None]:
%%time
logrcv_pipe2 = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2,8))),
    ('logrCV', LogisticRegressionCV(cv=5, solver='sag', random_state=11, max_iter=1000, n_jobs=-1)),
])

logrcv_pipe2.fit(X_train.Name, y_train)

In [None]:
logrcv_pipe2.score(X_test.Name, y_test)

In [226]:
import numpy as np
from sklearn.model_selection import KFold

#X = ["a", "b", "c", "d"]
kf = KFold(2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))


[ 962333  962334  962335 ... 1924662 1924663 1924664] [     0      1      2 ... 962330 962331 962332]
[     0      1      2 ... 962330 962331 962332] [ 962333  962334  962335 ... 1924662 1924663 1924664]


In [211]:
X_test_counts

<384933x83398 sparse matrix of type '<class 'numpy.int64'>'
	with 7124730 stored elements in Compressed Sparse Row format>

In [188]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(3,4))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer(use_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.848550267189355

In [175]:
model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

{'norm': 'l2', 'smooth_idf': True, 'sublinear_tf': False, 'use_idf': True}

In [164]:
X_train_counts

<1539732x27 sparse matrix of type '<class 'numpy.int64'>'
	with 9767778 stored elements in Compressed Sparse Row format>

In [149]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1, 1))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

predicted = model_mnb.predict(X_test_tfidf)

In [150]:
model_mnb.score(X_test_tfidf, y_test)

0.6323931697204449

In [132]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(1, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7593269478065013

In [151]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(1, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8024929013620552

In [152]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(2, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.8040516141770127

In [153]:
cwb_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7668970963778112

In [154]:
cwb_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 1))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.6359184585369402

In [155]:
cwb_vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.704418691044935

In [156]:
cwb_vectorizer = CountVectorizer(analyzer='char', lowercase=False, ngram_range=(1, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7977596101139679

In [136]:
cwb_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7042212540883738

In [157]:
cwb_vectorizer = CountVectorizer(analyzer='char', lowercase=False, ngram_range=(2, 2))
X_train_counts = cwb_vectorizer.fit_transform(X_train.Name)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model_mnb = MultinomialNB().fit(X_train_tfidf, y_train)

X_test_counts = cwb_vectorizer.transform(X_test.Name)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

model_mnb.score(X_test_tfidf, y_test)

0.7984922051369978