# Organize ML projects with Scikit-Learn

While Machine Learning is powerful, people often overestimate it: apply machine learning to your project, and all your problems will be solved. In reality, it's not this simple. To be effective, one needs to organize the work very well. In this notebook, we will walkthrough practical aspects of a ML project. To look at the big picture, let's start with a checklist below. It should work reasonably well for most ML projects, but make sure to adapt it to your needs:

1. **Define the scope of work and objective**
    * How is your solution be used?
    * How should performance be measured? Are there any contraints?
    * How would the problem be solved manually?
    * List the available assumptions, and verify if possible.
    
    
2. **Get the data**
    * Document where you can get that data
    * Store data in a workspace you can easily access
    * Convert the data to a format you can easily manipulate
    * Check the overview (size, type, sample, description, statistics)
    * Data cleaning
    
    
3. **EDA & Data transformation**
    * Study each attribute and its characteristics (missing values, type of distribution, usefulness)
    * Visualize the data
    * Study the correlations between attributes
    * Feature selection, Feature Engineering, Feature scaling
    * Write functions for all data transformations
    
    
4. **Train models**
    * Automate as much as possible
    * Train promising models quickly using standard parameters. Measure and compare their performance
    * Analyze the errors the models make
    * Shortlist the top three of five most promising models, preferring models that make different types of errors.


5. **Fine-tunning**
    * Treat data transformation choices as hyperparameters, expecially when you are not sure about them (e.g., replace missing values with zeros or with the median value)
    * Unless there are very few hyperparameter value to explore, prefer random search over grid search.
    * Try ensemble methods
    * Test your final model on the test set to estimate the generalizaiton error. Don't tweak your model again, you would start overfitting the test set.

## Example: Articles categorization

### Objectives

Build a model to determine the categories of articles. 

### Get Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [2]:
bbc = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/bbc-text.csv')

In [3]:
bbc.sample(5)

Unnamed: 0,category,text
1233,sport,fear will help france - laporte france coach b...
1750,sport,connors rallying cry for british tennis do y...
59,business,us regulator to rule on pain drug us food and ...
759,business,gm pays $2bn to evade fiat buyout general moto...
318,business,uk bank seals south korean deal uk-based bank ...


In [4]:
bbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   text      2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [6]:
bbc.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

In [15]:
text = bbc.text
category = bbc.category

In [11]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [12]:
import re 

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

In [13]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

# Split a text into list of words
def tokenizer(text):
    return text.split()

# Split a text into list of words and apply stemming technic
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text, category, test_size=0.3, random_state=42)

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

# Define an object of CountVectorizer() fit and transfom your twits into a 'bag'
count = CountVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

clf_logistic_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=42))])
clf_dtree_tfidf = Pipeline([('vect', tfidf), ('clf', DecisionTreeClassifier())])
clf_rforest_tfidf = Pipeline([('vect', tfidf), ('clf', RandomForestClassifier())])
clf_logistic_count = Pipeline([('vect', count), ('clf', LogisticRegression(random_state=42))])
clf_dtree_count = Pipeline([('vect', count), ('clf', DecisionTreeClassifier())])
clf_rforest_count = Pipeline([('vect', count), ('clf', RandomForestClassifier())])
clf_nb = Pipeline([('vect', count), ('clf', MultinomialNB())])

clfs = {
    'LogisticRegression tfidf': clf_logistic_tfidf,
    'DecisionTree tfidf': clf_dtree_tfidf,
    'RandomForest tfidf': clf_rforest_tfidf,
    'LogisticRegression count': clf_logistic_count,
    'DecisionTree count': clf_dtree_count,
    'RandomForest count': clf_rforest_count,
    'NaiveBayes': clf_nb
}

for clf in clfs:
    clfs[clf].fit(X_train, y_train)

In [50]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Now apply those above metrics to evaluate your model
# Your code here
for clf in clfs:
    predictions = clfs[clf].predict(X_test)
    print(clf)
    print('accuracy:',accuracy_score(y_test,predictions))
#     print('confusion matrix:\n',confusion_matrix(y_test,predictions))
#     print('classification report:\n',classification_report(y_test,predictions))

LogisticRegression tfidf
accuracy: 0.9760479041916168
DecisionTree tfidf
accuracy: 0.8308383233532934
RandomForest tfidf
accuracy: 0.9431137724550899
LogisticRegression count
accuracy: 0.9640718562874252
DecisionTree count
accuracy: 0.8188622754491018
RandomForest count
accuracy: 0.9491017964071856
NaiveBayes
accuracy: 0.9745508982035929


In [51]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier

models_comparison = {}

base_classifiers = [('LogisticRegression tfidf', clf_logistic_tfidf),
                    ('RandomForest tfidf', clf_rforest_tfidf,),
                    ('NaiveBayes', clf_nb)]

ensembles = {
    "Voting": VotingClassifier(base_classifiers)
}

for ensemble in ensembles: 
    ensembles[ensemble].fit(X_train, y_train)
    predictions = ensembles[ensemble].predict(X_test)
    print(ensemble)
    print('accuracy:',accuracy_score(y_test,predictions))
    print('confusion matrix:\n',confusion_matrix(y_test,predictions))
    print('classification report:\n',classification_report(y_test,predictions))

Voting
accuracy: 0.9730538922155688
confusion matrix:
 [[158   0   5   0   1]
 [  4 107   0   0   2]
 [  2   0 111   0   0]
 [  0   0   0 146   0]
 [  3   0   0   1 128]]
classification report:
                precision    recall  f1-score   support

     business       0.95      0.96      0.95       164
entertainment       1.00      0.95      0.97       113
     politics       0.96      0.98      0.97       113
        sport       0.99      1.00      1.00       146
         tech       0.98      0.97      0.97       132

     accuracy                           0.97       668
    macro avg       0.97      0.97      0.97       668
 weighted avg       0.97      0.97      0.97       668

