# Machine Learning and Modeling
For this section on modeling and machine learning, we'll focus on scikit-learn as the main library. The [scikit-learn documentation](http://scikit-learn.org/stable/) is very thorough and a good source of general information about different models and applications as well as specific details about their implementation, requirements and syntax.

### Model examples:
1. Linear regression
2. Logistic regression
3. Vocabulary based classifier - a la Twitter topic models

### Modeling steps: 
1. Data collection
1. Data QA and cleaning
1. Feature engineering and extraction
1. Split data into train/test
1. Model selection
1. Model tuning (repeat as necessary)
1. Model application
1. Regularly review and re-fit models if/when they deteriorate.

Something to note about scikit-learn modeling - the feature matrices are always expected to be numeric. This is directly from their [FAQ](http://scikit-learn.org/stable/faq.html):

*Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.*

There is a section in that same FAQ that explains how to deal with string data.

In [None]:
# import necessary libraries
import pandas as pd # for data frames, reading and writing data
import numpy as np
import re
from matplotlib import pyplot as plt
from sklearn import datasets, linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Ridge, SGDClassifier, Lasso
from sklearn.feature_extraction import text
from nltk.stem import SnowballStemmer
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix 
from sklearn.svm import SVC 

# the next line is so that the matplot lib plots show up in the notebook cell
%matplotlib inline

## Load Data
Scikit-Learn comes with some built-in datasets that we can use. Let's take a look at their "boston" dataset, which is a dataset with house prices and features that could be predictive for house prices. The median value is typically the target. Calling `sklearn.datasets.load_boston()` returns a dictionary with keys:
* data
* target
* feature_names
* DESCR (description)

To be more consistent with how we normally get data (reading from a dataset or table, rather than arrays), I've saved the `boston` data as a worksheet in our sample_data.xlsx workbook. The worksheet is called 'boston'.

In [None]:
boston = datasets.load_boston()

In [None]:
boston['feature_names']

Let's make this data into a dateframe to be more consistent with what we normally see when pulling data:

In [None]:
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['target'] = boston['target']
boston_df.head()

## Read data from Excel
This data frame is now saved in the sample_data.xlsx file that we've been using. Load the sheet 'boston'.

In [None]:
filename = 'sample_data.xlsx'
boston_df = 
boston_df.head()

In [None]:
boston_df.describe()

### Response Variable
Let's take a quick look at the response variable:

In [None]:
# Create a histogram of the target variable


#### Log of Response Variable
Let's take a look at the log-response to see if it is potentially better for our modeling purposes.

In [None]:
# Create a histogram of the log of the target variable


The log-transform of the response variable has a distribution much closer to normal. We'll create a log-target variable to see if it gives us better results.

In [None]:
# Create a new variable - 'log_target' showing the log of the target
boston_df['log_target'] = 

## Splitting Data into Test and Train
Before fitting a predictive model, we'll split the data into train and test data sets. This can be done using the `sample` method from pandas, or the `train_test_split` method in scikit-learn.

In [None]:
train_data, test_data = 
print('Train length: {} \nTest Length: {}'.format(len(train_data), len(test_data)))

## Principal Components Analysis
Let's start with a PCA of this data. 

Here's an example in [Scikit-Learn](http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html)

In [None]:
#Initiate the PCA model, setting the number of components to use
pca_unscaled = 

#Fit the PCA using training data
pca_unscaled.fit(train_data.iloc[:,0:13])

# Use a dataframe for easier viewing:
pd.DataFrame(pca_unscaled.components_, columns=boston_df.columns[0:13])

In [None]:
X_scaler = StandardScaler()
# Scale the training and test data
X_train = 
X_test = 

# Re-run the PCA using the scaled data
pca_scaled = 

#Re-Fit the model
pca_scaled.fit(X_train)

# View in a dataframe again:
pd.DataFrame(pca_scaled.components_, columns=boston_df.columns[0:13])

We can look at how much of the variance is explained by each of the principal components by looking at the `explained_variance_ratio`.

In [None]:
list(pca_scaled.explained_variance_ratio_)

In [None]:
list(pca_unscaled.explained_variance_ratio_)

## Linear Regression
Normally we'd next spend a good amount of time exploring the data for natural relationships, correlations, etc... However, for our purposes here, let's skip directly to fitting a linear regression model using all of the features in our dataset. Conveniently, all of the data is already numeric, so there is no data transformation required, other than normalization, which we've already done with the `StandardScaler`.

### Define Response Variable(s)

In [None]:
Y_train =  train_data['target']
Y_test = test_data['target']
Y_train_log = train_data['log_target']
Y_test_log = test_data['log_target']

### Fit Regression Model
We'll create a linear regression model using all of the features available for now. After fitting the model, we'll predict values for both the train and test data in order to evaluate the fit for both.

In [None]:
# Create linear regression object
regr1 = 

# Train the model using the training sets
regr1.fit( )

# Make predictions - calling the 'predict' method on the model
Y_pred_train = 
Y_pred_test = 

# The coefficients
print('Coefficients: \n', regr1.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test, Y_pred_test))
# Explained variance score: 1 is perfect prediction
print('Train variance score: %.2f' % r2_score(Y_train, Y_pred_train))
print('Test variance score: %.2f' % r2_score(Y_test, Y_pred_test))

### Fit Regression to log-response
Let's run it again using the log-response as our target instead and compare results.

In [None]:
# Create linear regression object
regr2 = 

# Train the model using the training sets
regr2.fit( )

# Make predictions 
Y_pred_log_train = 
Y_pred_log_test = 

# The coefficients
print('Coefficients: \n', regr2.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(Y_test_log, Y_pred_log_test))
# Explained variance score: 1 is perfect prediction
print('Train variance score: %.2f' % r2_score(Y_train_log, Y_pred_log_train))
print('Test variance score: %.2f' % r2_score(Y_test_log, Y_pred_log_test))

## Classifier Models
Next let's look at a classifier model using our classified life-event tweets. We'll create a classifier based on the vocabulary of the tweets. So far we've only created binary classifiers, since the tweet data has been gathered by tweet topic. However, since we have a dataset with multiple tweet categories classified, we'll create a multi-class classifier model.

### Load data

In [None]:
filename = 'sample_data.xlsx'
t_data = pd.read_excel(filename, sheet_name='tweets_classified')
t_data.head()

Before extracting the vocabulary from these texts, the data needs a little cleaning to remove some noise. Here's what we'll do:
1. Convert everything to lower case
1. Remove hyperlinks
1. Stemming - cutting words down to their roots

In [None]:
t_data['mod_text'] = [re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '',
                    str(x).lower()) for x in t_data['text']]
t_data.head()

## Split into train/test

In [None]:
t_train, t_test = 
print('Train length: {} \nTest Length: {}'.format(len(t_train), len(t_test)))

### Word Count Vectorizer
Now that we've 'cleaned' our data, we need to transform it into a vocabulary of words that becomes our feature matrix for scikit-learn. We'll use the `CountVectorizer` from scikit-learn along with the `SnowballStemmer` from `nltk` (natural language tool kit) to create our feature vectors.

In [None]:
# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
# vectorizer = CountVectorizer(min_df=10, stop_words=stop_words, ngram_range=(1,2))

# fit_transform() does two functions: First, it fits the model and learns the vocabulary;
# second, it transforms our training data into feature vectors. 
# The input to fit_transform should be a list of strings.

#creating the custom, stemmed count vectorizer
english_stemmer = SnowballStemmer('english')
stop_words = text.ENGLISH_STOP_WORDS
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])

vectorizer_s = StemmedCountVectorizer(min_df=5, analyzer="word", stop_words=stop_words)

### Extract Feature Matrices
Using the vectorizer that we just created, we can extract our feature matrices from the training and test data.

In [None]:
# Create the feature vectors for all of the different data sets.
train_data_features = 
test_data_features = 
word_features = 

Let's see what words are used most frequently. Looking at the top 20 words:

In [None]:
word_counts = train_data_features.sum(axis = 0)
word_count_df = pd.DataFrame(word_features)
word_count_df['count'] = word_counts[0,:].tolist()[0]
word_count_df.columns = ['word','count']
word_count_df.sort_values(by = 'count', ascending = False).head(20)

## Response variable
We are trying to predict the topic that each tweet belongs to. Since we have multiple topics represented, this becomes a multi-class classification problem. Let's look at the value counts for our different topics.

In [1]:
t_data.topic.value_counts()

NameError: name 't_data' is not defined

To make these text labels into targets that scikit-learn can work with, we need to transform them into binary varialbes. We can use the Scikit-learn "LabelBinarizer" to simplify that for us.

### Classified Model
We'll use a suport vector machine (SVM) model to build our classifier for these tweets. The feature matrices are the inputs and the targets are the topic labels. [SVC Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(train_data_features, t_train.topic) 
train_predictions = 
test_predictions = 
  
# model accuracy for X_test   
train_accuracy = svm_model_linear.score(train_data_features, t_train.topic) 
test_accuracy = svm_model_linear.score(test_data_features, t_test.topic) 
print('Train accuracy: {:0.4f} \nTest accuracy: {:0.4f}'.format(train_accuracy, test_accuracy))
  
# creating a confusion matrix 
cm = confusion_matrix(t_test.topic, test_predictions) 
cm

In [None]:
t_test.topic.value_counts()