### libraries

In [71]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np
# import matplotlib.pyplot as plt
import itertools
%matplotlib inline

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).


2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.



In [37]:
newsgroups_train = fetch_20newsgroups(subset='all')

In [38]:
# vectorize the dataset (with tfidf)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)

# split into 10% validation set, 10% test set and 80% training set

vectors_train, vectors_test, target_train, target_test = train_test_split(vectors, newsgroups_train.target, test_size=0.2, random_state=1)
vectors_test, vectors_val, target_test, target_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)


### Applied machine learning

In [65]:
# clf = MultinomialNB(alpha=.01)
# clf.fit(vectors_train, target_train)
# pred = clf.predict(vectors_test)
# metrics.f1_score(target_test, pred, average='macro')

0.90094518403431878

In [63]:
# def plot_confusion_matrix(cm, classes,
#                           normalize=False,
#                           title='Confusion matrix',
#                           cmap=plt.cm.Blues):
#     """
#     This function prints and plots the confusion matrix.
#     Normalization can be applied by setting `normalize=True`.
#     """
#     if normalize:
#         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
#         print("Normalized confusion matrix")
#     else:
#         print('Confusion matrix, without normalization')

#     print(cm)

#     plt.imshow(cm, interpolation='nearest', cmap=cmap)
#     plt.title(title)
#     plt.colorbar()
#     tick_marks = np.arange(len(classes))
#     plt.xticks(tick_marks, classes, rotation=45)
#     plt.yticks(tick_marks, classes)

#     fmt = '.2f' if normalize else 'd'
#     thresh = cm.max() / 2.
#     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
#         plt.text(j, i, format(cm[i, j], fmt),
#                  horizontalalignment="center",
#                  color="white" if cm[i, j] > thresh else "black")

#     plt.tight_layout()
#     plt.ylabel('True label')
#     plt.xlabel('Predicted label')

# cm = confusion_matrix(target_test, pred)
# plot_confusion_matrix(cm, newsgroups_train.target, True)

### Random Forest
#### four hyperparameters
1. the number of trees in the forest (n_estimators).
2. the number of features to consider at each split. By default: square root of total number of features (max_features).
3. the maximum depth of a tree i.e. number of nodes (max_depth).
4. the minimum number of samples required to be at a leaf node / bottom of a tree (min_samples_leaf).

Based on the requirement, we just use grid search to find n_estimators and max_depth

In [89]:
n_estimators = [i for i in range(35,45)]
max_depth = [i for i in range(35,45)]

parameters = {'n_estimators': n_estimators, 'max_depth': max_depth}

clf = RandomForestClassifier()
clf = GridSearchCV(clf, parameters)
clf.fit(vectors_train, target_train)
# clf.fit(vectors_train, target_train)
# pred = clf.predict(vectors_test)
# clf.score(vectors_test, target_test)
# metrics.f1_score(target_test, pred, average='macro')

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 'max_depth': [35, 36, 37, 38, 39, 40, 41, 42, 43, 44]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [90]:
clf.best_params_

{'max_depth': 44, 'n_estimators': 43}

In [91]:
clf.best_score_

0.76505704430883525