## Exercise 01: Dealing with categories

Load the Adults dataset from `data/adult.csv.zip` and build a machine learning model to estimate the target `income`. This dataset contains not just numerical variables, but categorical ones, so don't forget to preprocess this variables as well before training the model. Choose the `LogisticRegression` model from `scikit-learn` for training.

*Note: This dataset is for classification, so feel free to experiment with any model you want from the classification `scikit-learn` catalog*

Here is the documentation of the dataset

```
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class: >50K, <=50K
```

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

Read the data

In [2]:
data = pd.read_csv("data/adult.csv.zip")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


Let's arrange data in X and y

In [4]:
X = data.drop(columns=["income"])
y = data["income"]

In [5]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [6]:
y.head()

0    <=50K
1    <=50K
2     >50K
3     >50K
4    <=50K
Name: income, dtype: object

Let's split data into train and test

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

In [8]:
X_train.shape

(34189, 14)

In [9]:
y_train.shape

(34189,)

In [10]:
X_test.shape

(14653, 14)

Let's preprocess categorical columns

In [11]:
X_train_cat = X_train.select_dtypes("O")

In [12]:
X_train_cat.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country
5765,Self-emp-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
2336,Private,Assoc-voc,Married-civ-spouse,Sales,Husband,White,Male,United-States
22156,Self-emp-not-inc,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States
38574,Self-emp-not-inc,Bachelors,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
43755,Private,Assoc-acdm,Married-civ-spouse,Craft-repair,Husband,White,Male,United-States


In [13]:
ohe = OneHotEncoder(sparse_output=False)

In [14]:
cat_data_ohe = ohe.fit_transform(X_train_cat)

In [15]:
cat_data_ohe

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [16]:
cat_data_ohe.shape

(34189, 102)

In [17]:
cat_data_ohe = pd.DataFrame(cat_data_ohe, columns=ohe.get_feature_names_out())

In [18]:
cat_data_ohe.head()

Unnamed: 0,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [19]:
X_train_full = pd.concat([X_train.reset_index(drop=True), cat_data_ohe], axis=1)

In [20]:
X_train_full.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,59,Self-emp-inc,36085,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38,Private,189922,Assoc-voc,11,Married-civ-spouse,Sales,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,41,Self-emp-not-inc,120539,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,45,Self-emp-not-inc,28497,Bachelors,13,Married-civ-spouse,Farming-fishing,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,30,Private,108386,Assoc-acdm,12,Married-civ-spouse,Craft-repair,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Remove original categorical columns

In [21]:
X_train_full = X_train_full.drop(columns=X_train_cat.columns)

In [22]:
X_train_full.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,59,36085,15,15024,0,60,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38,189922,11,0,0,50,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,41,120539,10,3103,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,45,28497,13,0,1485,70,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,30,108386,12,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [23]:
X_train_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34189 entries, 0 to 34188
Columns: 108 entries, age to native-country_Yugoslavia
dtypes: float64(102), int64(6)
memory usage: 28.2 MB


Build a `LogisticRegression` model

In [24]:
lr = LogisticRegression()

In [25]:
lr.fit(X_train_full, y_train)

In [26]:
X_test_cat = X_test.select_dtypes("O")

In [27]:
X_test_cat.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country
20515,Private,Some-college,Widowed,Exec-managerial,Unmarried,White,Female,United-States
356,Private,Bachelors,Never-married,Sales,Not-in-family,White,Male,United-States
7772,Private,Some-college,Never-married,Adm-clerical,Not-in-family,White,Female,United-States
34450,State-gov,HS-grad,Married-civ-spouse,Adm-clerical,Wife,White,Female,United-States
19643,Private,10th,Never-married,Other-service,Own-child,Black,Female,United-States


In [28]:
X_test_ohe = ohe.transform(X_test_cat)

In [29]:
X_test_ohe = pd.DataFrame(X_test_ohe, columns=ohe.get_feature_names_out())

In [30]:
X_test_ohe.head()

Unnamed: 0,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [31]:
X_test_full = pd.concat([X_test.reset_index(drop=True), X_test_ohe], axis=1)

In [32]:
X_test_full.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,31,Private,73796,Some-college,10,Widowed,Exec-managerial,Unmarried,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,33,Private,90409,Bachelors,13,Never-married,Sales,Not-in-family,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,21,Private,29810,Some-college,10,Never-married,Adm-clerical,Not-in-family,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,41,State-gov,176663,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,29,Private,136277,10th,6,Never-married,Other-service,Own-child,Black,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [33]:
X_test_full = X_test_full.drop(columns=X_test_cat.columns)

In [34]:
X_test_full.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,31,73796,10,0,0,30,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,33,90409,13,0,0,45,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,21,29810,10,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,41,176663,9,0,0,40,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,29,136277,6,0,0,32,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [35]:
lr.predict(X_test_full)

array(['<=50K', '<=50K', '<=50K', ..., '<=50K', '>50K', '<=50K'],
      dtype=object)

In [36]:
lr.score(X_test_full, y_test)

0.8004504197092746

## Exercise 02: Dealing with text

Load the **20 newsgroups** dataset from `scikit-learn` with the code below.
1. Build a classification model (`LogisticRegression`) on the training set
2. Load the "test" set a use your model to `predict` the nex texts' category
3. Calculate the `accuracy` of the model on the test set

*Note: this is a text dataset, so use your tools available to first process the text in order to train a model*

In [4]:
import pandas as pd

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
# first load the dataset from sklearn package
data = fetch_20newsgroups(subset="train", remove=("headers", "footers", "quotes"))
text = data["data"]
target = data["target"]
target_names = dict(enumerate(data["target_names"]))

In [5]:
# prepare data in a DataFrame

data = pd.DataFrame({
    "text": text,
    "target": target
})

data.target = data.target.replace(target_names)

In [6]:
data.head()

Unnamed: 0,text,target
0,I was wondering if anyone out there could enli...,rec.autos
1,A fair number of brave souls who upgraded thei...,comp.sys.mac.hardware
2,"well folks, my mac plus finally gave up the gh...",comp.sys.mac.hardware
3,\nDo you have Weitek's address/phone number? ...,comp.graphics
4,"From article <C5owCB.n3p@world.std.com>, by to...",sci.space


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11314 entries, 0 to 11313
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    11314 non-null  object
 1   target  11314 non-null  object
dtypes: object(2)
memory usage: 176.9+ KB


In [8]:
# to print the text of one particular sample

print(data.iloc[1].text)

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.


## SOLUTION

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression

We have a dataframe with only one column with text. In order to build a ML model, first we need to transform the text into vectors. To do so, we're using *TF-IDF* technique. To illustrate the power of this functionality in `scikit-learn`, we're solving this exercise two ways:
1. Using plain `TfidfVectorizer`, without parameters.
2. Using the additional capabilities `TfidfVectorizer` offer to build NLP models.

But first, arrange $X$ and $y$, and split data into train and test

In [9]:
X = data[["text"]]
y = data["target"]

In [10]:
X.head()

Unnamed: 0,text
0,I was wondering if anyone out there could enli...
1,A fair number of brave souls who upgraded thei...
2,"well folks, my mac plus finally gave up the gh..."
3,\nDo you have Weitek's address/phone number? ...
4,"From article <C5owCB.n3p@world.std.com>, by to..."


In [11]:
y.head()

0                rec.autos
1    comp.sys.mac.hardware
2    comp.sys.mac.hardware
3            comp.graphics
4                sci.space
Name: target, dtype: object

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [15]:
print(f"Train: {X_train.shape}")
print(f"Test: {X_test.shape}")

Train: (7919, 1)
Test: (3395, 1)


### **Simple `TfidfVectorizer`**

Let's import `TfidfVectorizer`

In [16]:
tfidf = TfidfVectorizer()

Calculate the `X_train` transformed

In [17]:
X_train_tr = tfidf.fit_transform(X_train.text)  # In this case, the "fit_transform" receives a Series with all documents

In [23]:
X_train_tr.toarray().sum()

55533.627157137904

In [18]:
X_train_tr.shape

(7919, 78933)

We see that ther're 7919 documents, and 78933 columns! (Every column corresponds to a word from the training corpus)

Now, let's train a `LogisticRegression` model

In [24]:
lr = LogisticRegression()

In [25]:
lr.fit(X_train_tr, y_train)  # we can use directly the sparse matrix "X_train_tr" in the LR model

Let's evaluate the results on **train**

In [26]:
lr.score(X_train_tr, y_train)

0.914888243465084

Let's evaluate the results on **test**

In [27]:
# First, transform test data into numbers with Tf-Idf
X_test_tr = tfidf.transform(X_test.text) # Be careful! Here we use "transform", not "fit_transform"

In [28]:
lr.score(X_test_tr, y_test)

0.7184094256259205

Let's predict we out new model over the test set

In [29]:
lr.predict(X_test_tr)

array(['talk.politics.mideast', 'rec.autos', 'comp.sys.ibm.pc.hardware',
       ..., 'rec.sport.baseball', 'rec.sport.baseball',
       'rec.sport.hockey'], dtype=object)

In [33]:
probas = lr.predict_proba(X_test_tr)

In [35]:
probas[0]

array([0.01801555, 0.00782297, 0.01153588, 0.01228335, 0.01340255,
       0.01076831, 0.00557847, 0.01700929, 0.02166862, 0.02220801,
       0.01911913, 0.02039678, 0.01396264, 0.0125997 , 0.02397564,
       0.01358093, 0.0231631 , 0.70432912, 0.01397986, 0.01460009])

In [38]:
lr.classes_

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype=object)

In [37]:
print(X_test.iloc[0].text)



	One of these days you'll learn that the way to stop Israel
from fighting back is to stop attacking.  If there were no attacks in
the security zone for a year because the Lebanese army could maintain
the peace, then Lebanon would be in much better shape.

	Tell me something, though.  Why do Syrian troops not get
attacked?  Aren't they occupying Lebanon?

	Israel has repeatedly stated that it will leave on two
conditions.  One is a demonstration that the Lebanese army can keep
the peace.  The second is that the Syrians pull out as well.

Adam
Adam Shostack 				       adam@das.harvard.edu


Ouch!. Seems there's a big difference between "train" accuracy and "test" accuracy. This is indicating that maybe there is **overfitting**.

### **Complete `TfidfVectorizer`**

Now we're going to transform text into numbers but including some other preprcessing methods typically used in NLP. Fortunatelly, the `TfidfVectorizer` model in `scikit-lear` provides us with a lot of useful functions for that.

```python
TfidfVectorizer(
    *,
    input='content',
    encoding='utf-8',
    decode_error='strict',
    strip_accents=None,
    lowercase=True,                       # automatically transform all text to lowercase
    preprocessor=None,
    tokenizer=None,                       # this controls how tokens (words) are extracted. By default text is splitted with "token_pattern"
    analyzer='word',
    stop_words=None,                      # this allow us to include stopwords
    token_pattern='(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),                   # this allow us to automatically calculate n-grams
    max_df=1.0,                           # this controls the maximum "document freq." of a word to be included in the vocabulary
    min_df=1,                             # this controls the minimun "document freq." of a word to be included in the vocabulary
    max_features=None,                    # to limit the number of columns we have in the resulting matrix after transformation
    vocabulary=None,                      # to specify directly a vocabulary instead of being extracted from all words in text
    binary=False,
    dtype=<class 'numpy.float64'>,
    norm='l2',
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False,
)
```

In this example we're going to include additional preprocessing techniques for the text
1. Include *stopwords* for the English language
2. Include bigrams, that is n-grams that are the combination of 2 words
3. Limit the minimum document frequency

First let's play with the stopwords

In [61]:
tfidf = TfidfVectorizer(
    stop_words="english",
)

In [62]:
X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 83145)

Now, or vocabulary have 300 words less, coresponding to the stopwords

Now, let's incude the bigrams

In [64]:
tfidf = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1,2)     # this includes monograms and bigrams
)

X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 713894)

The vocabulary is incredibly huge because we've incorporated combinations of two consecutive words (bigrams)

Finally, let's establish the minimum frequencia for a token to be considered in the vocabulary. For example to 3. 

In [91]:
tfidf = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1,2),     # this includes monograms and bigrams
    min_df=5
)

X_train_tr = tfidf.fit_transform(X_train.text)
X_train_tr.shape

(7919, 20344)

We've reduce the vocabulary a lot, because now a token must appear 5 times or more in the corpus to be considered.

Train a model with this new dataset

In [92]:
lr = LogisticRegression()
lr.fit(X_train_tr, y_train)  # we can use directly the sparse matrix "X_train_tr" in the LR model

In [93]:
lr.score(X_train_tr, y_train)

0.9181714862987751

In [94]:
# First, transform test data into numbers with Tf-Idf
X_test_tr = tfidf.transform(X_test.text) # Be careful! Here we use "transform", not "fit_transform"

In [95]:
lr.score(X_test_tr, y_test)

0.7240058910162003

**Conclusion**
- We've reduced the data complexity by reducing the number of columns after the TfIdf transformation
- Even with this change, the results are still similar, what means that the columns (information) removed wasn't relevant. Still we have a lot of overfitting in training, so this model is not good.
- What can we do to avoid overfitting and improve results? We can take several extra steps
   - Further preprocessing on text - For now we've just reduced the number of tokens by removing stopwords and limiting the number of occurrences of each token in the corpus. Also we have included bigrams. What else can we do?:
       - Include stemming and lemmatization to keep standardized versions of the words (ex: keeping just the lemma). This will reduce the number of words, while keeping the amount of information.
       - Use more advanced language models to transform text into numbers
       - Top libraries for NLP: NLTK (https://www.nltk.org/), spaCy (https://spacy.io/), Gensim (https://radimrehurek.com/gensim/)
   - Use regularization to reduce overfitting in the LogisticRegression
   - Use a more advanced, nonlinear model like Random Forests