## Final Project - Notes + Annotated Code

**Methodology followed: **

1. X - Vectorize the data using ngram_range: to combine unigrams, bigrams, trigrams
2. y - The sentiment labels - Negative, Neutral and Positive
3. Make a train test split based on the X and y defined above.
    - Setting 25% as the Test data and 75% as training data
    - Setting random state to prevent data from changing
4. Initialize models (Most models were initialized on default parameters)
5. Predict values based on these models
6. Calculate Metrics:
    - Accuracy
    - F1-Score
7. Compare models based on the metrics
8. Apply grid search on the best model
9. Implement cross_validation_score from sklearn for the best model using the parameters found by GridSearch.
10. Build and visualize confusion matrix for the best model

**Reason for using SVM:**

- It can be used where the other models cannot perform as required: 
    - Because there was some degree of class imbalance in my dataset.
- Other advantages i considered:
    - It doesn't require a particular type of distribution.
    - Ablity to make use of non-linear kernels.
    - Doesn't suffer from multicollinearity.

**Reason for using Ensemble Classifiers:** : My goal was to improve accuracy.

Some other advantages:

- Its extremely randomized version of DecisionTreeClassifier.
- It helps to tackle the variance in the training data.
- They more robust.

**Reason for using Cross Validation:**

Needed to determine how well the model will perform for random test samples. i.e. it is used for model evaluation.

- It removes some of the data before the training of the model begins. And uses the removed data to test the model.


** Annotating the ExtraTreesClassifier **

The ExtraTreesClassifier is a child class of ForestClassifier that: 

- Is a base class for forest of trees-based classifiers.
- It computes different parameters such as:
    - Base_estimator - based on tree-based classifier.
    - n_estimators - the number of trees.
    - estimator_params - pass in estimator parameters.
    - bootstrap=False - whether to pass in boostrap samples or not.
    - oob_score=False - to check whether to use out of back score.
    - n_jobs=1 - number of parallel processes to run 
    - random_state=None - if this is set then the random numbers don't change on iterations.
    - verbose=0 - to set how verbose the tree building process should be.
    - warm_start=False - to reuse previous fits if this is set
    - class_weight=None - by default all classes have same weight but this can be changed by altering this parameter.

In [3]:
## Fit and Predict method of ExtraTreesClassifier : Source: Scikit Learn ensemble methods

** Fit, fit_transform and transform ** methods

In [5]:
def fit(self, X, y=None, sample_weight=None):
    """Fit estimator.
    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        The input samples. Use ``dtype=np.float32`` for maximum
        efficiency. Sparse matrices are also supported, use sparse
        ``csc_matrix`` for maximum efficiency.
    Returns
    -------
    self : object
        Returns self.
    """
    # It calls the fit_transform function from the same class that takes an input of
    # X: Features and y: labels, also sets the sample weight as per users input
    self.fit_transform(X, y, sample_weight=sample_weight)
    return self

def fit_transform(self, X, y=None, sample_weight=None):
    """Fit estimator and transform dataset.
    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Input data used to build forests. Use ``dtype=np.float32`` for
        maximum efficiency.
    Returns
    -------
    X_transformed : sparse matrix, shape=(n_samples, n_out)
        Transformed dataset.
    """
    # ensure_2d=False because there are actually unit test checking we fail
    # for 1d.
    X = check_array(X, accept_sparse=['csc'], ensure_2d=False)
    
    # Check if the input X is sparse
    if issparse(X):
        
        # If yes then Pre-sort indices to avoid that each individual tree of the
        # Sort the indices of X.
        X.sort_indices()
    
    # Checks whether the random state is true or false and updates a varaible 
    # rnd based on that.
    rnd = check_random_state(self.random_state)
    
    # Create samples that have a uniformly distribution and are
    # Distrubuted over the half-open interval [low, high]
    y = rnd.uniform(size=X.shape[0])
    
    # Call the superclass to avoid explicitly refer the base class
    super(RandomTreesEmbedding, self).fit(X, y, sample_weight=sample_weight)
    
    # Call the OneHotEncoder function that:
    # is required for feeding categorical data.
    # it creates a sparse matrix based on the input of matrix of integers
    self.one_hot_encoder_ = OneHotEncoder(sparse=self.sparse_output)
    
    # Apply X to fit transform of one hot encoder and return it
    return self.one_hot_encoder_.fit_transform(self.apply(X))

def transform(self, X):
    """Transform dataset.
    Parameters
    ----------
    X : array-like or sparse matrix, shape=(n_samples, n_features)
        Input data to be transformed. Use ``dtype=np.float32`` for maximum
        efficiency. Sparse matrices are also supported, use sparse
        ``csr_matrix`` for maximum efficiency.
    Returns
    -------
    X_transformed : sparse matrix, shape=(n_samples, n_out)
        Transformed dataset.
    """
    # Apply X to transform of one hot encoder and return it 
    return self.one_hot_encoder_.transform(self.apply(X))

In [4]:
def predict(self, X):
    """Predict class for X.
    The predicted class of an input sample is a vote by the trees in
    the forest, weighted by their probability estimates. That is,
    the predicted class is the one with highest mean probability
    estimate across the trees.
    Parameters
    ----------
    X : array-like or sparse matrix of shape = [n_samples, n_features]
       The input samples. Internally, it will be converted to
       ``dtype=np.float32`` and if a sparse matrix is provided
       to a sparse ``csr_matrix``.
    Returns
    -------
    y : array of shape = [n_samples] or [n_samples, n_outputs]
       The predicted classes.
    """
    # Calling the predict_proba() function from the class and computing the
    # the probabilities of X and saving it in proba.
    proba = self.predict_proba(X)
    
    # Checking if number of outputs is equal to 1
    # Number of outputs is the number of outputs when fit is used.
    if self.n_outputs_ == 1:
        
        # If it is then we are returning the indices of the maximum values along the axis.
        return self.classes_.take(np.argmax(proba, axis=1), axis=0)
    else:
        
        # Setting the number of samples equal to number of observations in the variable 
        # proba, which contains the probabilities of X.
        n_samples = proba[0].shape[0]
        
        # Define predictions as an array of zeros of the same size as the number of samples
        predictions = np.zeros((n_samples, self.n_outputs_))
        
        # Loop over the range of n_outputs 
        for k in range(self.n_outputs_):
            
            # for every row at k'th column calculate the prediction and update the prediction array
            predictions[:, k] = self.classes_[k].take(np.argmax(proba[k],axis=1),axis=0)
        
        # Return the predictions
        return predictions