# Day 89 - random forest, grid search & CountVectorizer

1. The following arrays are given: <br>
<br>
X_train, y_train <br>
X_test, y_test <br>
<br>
Using the RandomForestClassifier class from the scikit-learn package, create a classification model (set  random_state=42). Train the model on the train set and evaluate on the test set.

In [1]:
import numpy as np
import pandas as pd
 
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
 
np.random.seed(42)
raw_data = make_moons(n_samples=2000, noise=0.25, random_state=42)
data = raw_data[0]
target = raw_data[1]
 
X_train, X_test, y_train, y_test = train_test_split(data, target)
 
classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
acc = classifier.score(X_test, y_test)
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.9300


2. The following arrays are given: <br>
<br>
X_train, y_train <br>
X_test, y_test <br>
<br>
Using the RandomForestClassifier class and grid search method (GridSearchCV class - set scoring='accuracy', cv=5) find the optimal values of criterion, max_depth and min_samples_leaf parameters. Search for parameter values from the following: <br>
for criterion -> ['gini', 'entropy'] <br>
for max_depth -> [6, 7, 8] <br>
for min_samples_leaf -> [4, 5] <br>
Train the model on the train set and evaluate on the test set.

In [2]:
import numpy as np
import pandas as pd
 
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
 
np.random.seed(42)
raw_data = make_moons(n_samples=2000, noise=0.25, random_state=42)
data = raw_data[0]
target = raw_data[1]
 
X_train, X_test, y_train, y_test = train_test_split(data, target)
 
classifier = RandomForestClassifier(random_state=42)
 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [6, 7, 8],
    'min_samples_leaf': [4, 5],
}
 
grid_search = GridSearchCV(
    classifier,
    param_grid=param_grid,
    n_jobs=-1,
    scoring='accuracy',
    cv=2,
)
grid_search.fit(X_train, y_train)
grid_search.score(X_test, y_test)
print(grid_search.best_params_)

{'criterion': 'gini', 'max_depth': 8, 'min_samples_leaf': 4}


3. The following list with text documents is given: <br>
<br>
documents = [ <br>
    'python is a programming language', <br>
    'python is popular', <br>
    'programming in python', <br>
    'object-oriented programming in python' <br>
] <br>
<br>
Vectorize your documents with the CountVectorizer class from the scikit-learn.

In [3]:
import numpy as np
import pandas as pd
 
from sklearn.feature_extraction.text import CountVectorizer
 
documents = [
    'python is a programming language',
    'python is popular',
    'programming in python',
    'object-oriented programming in python',
]
 
vectorizer = CountVectorizer()
 
df = pd.DataFrame(
    data=vectorizer.fit_transform(documents).toarray(),
    columns=vectorizer.get_feature_names(),
)
print(df)



   in  is  language  object  oriented  popular  programming  python
0   0   1         1       0         0        0            1       1
1   0   1         0       0         0        1            0       1
2   1   0         0       0         0        0            1       1
3   1   0         0       1         1        0            1       1
