# FAKE REVIEW DETECTION


## Seattle University
## Instructor: Dr. Kim
## Student: Huy Le

### Readme
This notebook contains entire of my code about my fake review detection research. It includes from the first step to get data from "yelpreview021302.csv" to implement machine learning models. If your computer can not execute the whole file due to the contrains of memory, you may think about break the code to many part such as Data Preprocessing, NLP (Tokenizing, TF-IDF, Consine Similarity, Sentiment Analysis), Clustering, Classifying Model, Decision Tree Diagram Generation.

## TOC:
* [1. Import Libraries and Dataset](#1st-bullet)
* [2. Summary Statistic](#2nd-bullet)
* [3. Data Preprocessing](#3rd-bullet)
    - [3.1 Textual Data Cleaning](#3_1-bullet)
* [4. Genereate Text Features](#4th-bullet)
    - [4.1 Part of Speech](#4_1-bullet)
    - [4.2 Cosine Similarity](#4_2-bullet)
        - [a. Cosine Similarity by N-grams](#4_2_a-bullet)
        - [b. Cosine Similarity by Part of Speech](#4_2_b-bullet)
    - [4.3 Sentiment analysis](#4_3-bullet)
* [5. Clusstering](#5th-bullet)
    - [5.1 Clustering with PoS-based cosine similarity](#5_1-bullet)
    - [5.2 Clustering with unigram-based Cosine similarity](#5_2-bullet)
    - [5.3 Integration cluster label with behavioral features](#5_3-bullet)
    
* [6. Building classify model](#6th-bullet)
    - [6.1 Features Selection based logistic regression](#6_1-bullet)
    - [6.2 SVM model](#6_2-bullet)
        - [a. PoS similarity](#6_2_a-bullet)
        - [b. Unigram similarity ](#6_2_b-bullet)
    - [6.3 Random Forest Classifiers](#6_3-bullet)
        - [a. PoS similarity](#6_3_a-bullet)
        - [b. Unigram similarity ](#6_3_b-bullet)
    - [6.4 Neural Network](#6_4-bullet)
        - [a. PoS similarity](#6_4_a-bullet)
        - [b. Unigram similarity ](#6_4_b-bullet)
    - [6.5 Decision Tree ](#6_5-bullet)
        - [a. PoS similarity](#6_5_a-bullet)
        - [b. Unigram similarity ](#6_5_b-bullet)
        - [c. Visualizing Decision Tree Diagram](#6_5_c-bullet)



# 1. Import Libraries and Dataset <a class="anchor" id="1st-bullet"></a>

The dataset I used in this research was acquire from Dr. Liu at Unversity of Illinois at Chicago. I had done some integration and preprocessing steps in SQL before I came up to this datasets. For more detail about my dataset, pleas read my paper.

In [None]:
# import necessary libraries
import sys
import nltk
import sklearn
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
nltk.download('stopwords') 



In [None]:
# load the dataset
raw_data= pd.read_csv('./yelpreview021302.csv')
raw_data.head(5)

In [None]:

# I only chose review with flagged N an Y which mean non-filterd and filterd review.
raw_data = raw_data.loc[(raw_data['flagged'] !='NR') & (raw_data['flagged']!='YR') ]
raw_data.head(5)

In [None]:
# Create a sample with 10000 observation with 50:50 fake and non-fake review.
#df = raw_data.sample(5000).groupby('flagged').head(5000)
fn = lambda obj: obj.loc[np.random.choice(obj.index, 5000, True),:]
df = raw_data.groupby('flagged', as_index=False).apply(fn)
df = shuffle(df)
df.reset_index(drop=True, inplace=True)


In [None]:
df[['flagged','buscateg','pricerange','firstreview']].head(10)

# 2 Summarize statistic <a class="anchor" id="2nd-bullet"></a>
- proportion of fake and non fake in dataset
- the corelation between "flagged and other feature"

In [None]:
# Summarize statistic
print(round(df.describe(),2))
df.isna().sum()

In [None]:
df['pricerange'] = df['pricerange'].fillna(0.0)

In [None]:
# checking data distribution
Y = df['flagged']

print(Y.value_counts())

print('% filtered reviews: {}'.format(round(Y.value_counts()[0]/len(Y)*100),3))
print('% non-filtered reviews: {}'.format(round(Y.value_counts()[1]/len(Y)*100),2))

In [None]:
#Export dataset to CSV file 
#df.to_csv('sampleof10k.csv')

# 3. Data Preprocessing  <a class="anchor" id="3rd-bullet"></a>
## 3.1 Textual Data Cleaning

steps:
- text feature genereate: 
    - n-gram
    - Part of Spech
    - TF-IDF
    - Cosine Similarity
- sentiment analysis



In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# store the review content and non-text features to diff dataframe
df['flagged'] = df['flagged'].astype('category')
df['buscateg'] = df['buscateg'].astype('category')
df['pricerange'] = df['pricerange'].astype('category')
df['firstreview'] = df['firstreview'].astype('category')


review_content = df['reviewcontent']
behavior_attr = df[['reviewrating',
                    'reusefulcount',
                    'recoolcount',
                    'refunnycount',
                    'friendcount',
                    'fancount',
                    'tipcount',
                    'reviewcount',
                    'firstcount',
                    'usefulcount',
                    'coolcount',
                    'complimentcount',
                    'funnycount',
                    'busrating',
                    'buscateg',
                    'pricerange',
                    'monmembership',
                    'firstreview',
                    'maxReviewDay',
                    'avgReviewDay',
                    'avgpostedrating',
                    'avgreviewlen'
                    ]]

# replace NaN with space
review_content = review_content.fillna('')



# convert class labels to binary values, 0 = ham and 1 = spam
encoder = LabelEncoder()
Y = encoder.fit_transform(df['flagged'])

behavior_attr['buscateg'] =  encoder.fit_transform(behavior_attr['buscateg'])
behavior_attr['pricerange'] = encoder.fit_transform(behavior_attr['pricerange'])
behavior_attr['firstreview'] = encoder.fit_transform(behavior_attr['firstreview'])

print(review_content[:10])

In [None]:
# use regular expressions to replace email address, URLs, Phone numbers, other numbers

# Replace email addresses with 'email'
processed = review_content.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$',
                                      'emailaddress')

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$',
                                   'webaddress')

# Replace moneu symbols with 'moneysymb' (€,£)
processed = processed.str.replace(r'€|\$|£', 'moneysymb')

# replace 10 digit phone numbers (formats include paranthesis, space, no space, dashes) with phonenumber
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$',
                                 'phonenumber')

#replace numbers with 'number'
processed = processed.str.replace(r'\d+(\.\d+)?','number')

In [None]:
# Remove Punctuation
processed = processed.str.replace(r'[^\w\d\s]', ' ')

# replace whitespac between terms with a single space
processed = processed.str.replace(r'\s+',' ') 

# remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$','')

In [None]:
# change words to lower case
processed = processed.str.lower()
print(processed)

In [None]:
# remove stopwords from review
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x: ' '.join(
            term for term in x.split() if term not in stop_words))

In [None]:
print(processed)

In [None]:
# remove word stems using a Porter stemmer
ps = nltk.PorterStemmer()

processed = processed.apply(lambda x: ' '.join(
            ps.stem(term) for term in x.split()))


In [None]:
#processed.to_csv('processed_content.csv', header ='review_content')

# 4 Generate Text features <a class="anchor" id="4th-bullet"></a>


## 4.1 Part of Speech <a class="anchor" id="4_1-bullet"></a>

In [None]:
# create part of speech represents of review content

from nltk import word_tokenize, pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

#tokenized review into word
#word_tok = processed.apply(word_tokenize)
#print(word_tok.head(10))


In [None]:
# [('I', 'PRP'), ("'m", 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP')]
pos = processed.apply(word_tokenize).apply(pos_tag)

In [None]:
for i, sent in enumerate(pos):
    pos[i] =' '.join([word + '_' + postag for word, postag in sent])

In [None]:
print(pos[:10])


## 4.2 Cosine Similarity <a class="anchor" id="4_2-bullet"></a>


In [None]:
# ML Packages For Vectorization of Text For Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### a. Cosine Similarity by N-grams <a class="anchor" id="4_2_a-bullet"></a>

In [None]:
corpus = processed
# using Uni-gram because bi-gram or trigram take too long to generate, but did show difference between reviews
#cv = CountVectorizer(analyzer='word', ngram_range=(2, 2))
#X_cv = cv.fit_transform(corpus) # Fit the Data

# create fi-idf vector
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1)) # create tf-idf vector based unigram, Range(1,1) mean n =[1,1]
#X_tfidf= tfidf.fit_transform(corpus)

In [None]:
# create cosine similarity by tf-idf vector
CS_similarity_bigram =cosine_similarity(tfidf.fit_transform(corpus))

In [None]:
# get the feature shape
CS_similarity_bigram.shape

In [None]:
CS_similarity_bigram

In [None]:
#del CS_similarity_bigram

### b. Cosine Similarity by Part of Speech <a class="anchor" id="4_2_b-bullet"></a>

In [None]:
corpus = pos
# create fi-idf vector base on POS
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1))
pos_tfidf= tfidf.fit_transform(corpus)

#print(pos_tfidf.shape)

In [None]:
tfidf.get_feature_names()

In [None]:
import time

In [None]:
import time
start_time = time.time()

# create cosine similarity by tf-idf vector
CS_similarity_pos =cosine_similarity(pos_tfidf)

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
CS_similarity_pos

In [None]:
# this chunk is used to save the cosine similarity matrix
# Use if you want separate my code to overcome memory exceed

# import pickle
# # #save variable
# with open('CS_similarity_pos', 'wb') as f:
#     pickle.dump(CS_similarity_pos, f)

# with open('CS_similarity_bigram', 'wb') as f:
#     pickle.dump(CS_similarity_bigram, f)
   
#load variable
#with open(filename, ‘rb’) as f:
    #var_you_want_to_load_into = pickle.load(f)

In [None]:
# cs_matrix = CS_similarity_pos

# for i in range(0, cs_matrix.shape[0]):
#     cs_matrix[i,i] =0

In [None]:
#cs_matrix

## 4.3 Sentiment analysis <a class="anchor" id="4_3-bullet"></a>


In [None]:
from textblob import TextBlob 
from textblob.sentiments import NaiveBayesAnalyzer

#Example how to get sentiment score of a text
#rv_sentiment = TextBlob("Came in on the early afternoon on Sunday. The food was tasty and priced well.",analyzer=NaiveBayesAnalyzer()).sentiment


### ** Note **
Because NaiveBayesAnalyzer() take too long to analyze the sentiment of whole dataset. I had tested and it take me more than 4hours. So, I choose move forward with pattern analyzer

In [None]:
# Polarity and subjectivity
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

# generate polarity and subjectivity then add them to behavior_attr dataframe
behavior_attr['polarity'] = processed.apply(pol)
behavior_attr['subjective'] = processed.apply(sub)

#reve_pos = []
#reve_pos += [TextBlob(x, analyzer= NaiveBayesAnalyzer()).sentiment.p_pos  for x in processed]

In [None]:
#behavior_attr.to_csv('behavior_attr.csv')

In [None]:
# Let's plot the results
# import matplotlib.pyplot as plt

# plt.rcParams['figure.figsize'] = [10, 8]

# x = polarity
# y = subjective
# plt.scatter(x, y, color='blue')
# #plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
# plt.xlim(-.01, .12) 
    
# plt.title('Sentiment Analysis', fontsize=20)
# plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
# plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

# plt.show()

### Behavior feature
- behavior_attr: behavioral attributes
### Text_feature
- processed: text content
- polarity: polarity of review
- subjective: subjectivity of review
- CS_similarity_pos: consine similarity generated from POS
- CS_similarity_bigram: cosine similarity generated from bi-gram

# 5 Clusstering <a class="anchor" id="5th-bullet"></a>

## 5.1 Clustering with PoS-based cosine similarity <a class="anchor" id="5_1-bullet"></a>

In [None]:
from sklearn.mixture import GaussianMixture


def em_clustering(X, k ):
    gmm = GaussianMixture(n_components=k, random_state= 138)
    gmm.fit(X)
    return gmm



In [None]:
# em Clustering
start_time = time.time()
k= 5 
gmm = em_clustering(CS_similarity_pos,k)


print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
import gc
gc.collect()

In [None]:
cluster_pos = gmm.predict(CS_similarity_pos)

cluster_pos[:10]

In [None]:
#plot clustering result
dt_plt = pd.DataFrame({'Cluster':cluster_pos})
dt_plt =pd.DataFrame({'Cluster':dt_plt.groupby('Cluster')['Cluster'].count().index, 'count': dt_plt.groupby('Cluster')['Cluster'].count()})

In [None]:
dt_plt

In [None]:
#visualizing
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl


In [None]:
color_iter = ['navy', 'turquoise', 'cornflowerblue','darkorange']
plt.figure(figsize=(6,6))
plt.bar(dt_plt['Cluster'], dt_plt['count'],  color=color_iter )
plt.xlabel('Clusters')
plt.xticks(dt_plt['Cluster'])
plt.ylabel('Count')
plt.title('Review Clustering by POS Cosine Similarity')

for x,y in zip(dt_plt['Cluster'], dt_plt['count']):
    plt.annotate('{}'.format(y ),
                 xy=(x , y + 10),
                 xytext=(0, 3),  # 3 points vertical offset
                 textcoords="offset points",
                 ha='center', va='bottom')
plt.show()

In [None]:
# This plot visualize cluster based on the cosine similarity among the reviews

plt.scatter(CS_similarity_pos[:, 0], CS_similarity_pos[:, 1], c=cluster_pos, s=40, cmap='viridis', alpha =0.5)
plt.show()

In [None]:
# def draw_ellipse(position, covariance, ax=None, **kwargs):
#     """Draw an ellipse with a given position and covariance"""
#     ax = ax or plt.gca()
    
#     # Convert covariance to principal axes
#     if covariance.shape == (2, 2):
#         U, s, Vt = np.linalg.svd(covariance)
#         angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
#         width, height = 2 * np.sqrt(s)
#     else:
#         angle = 0
#         width, height = 2 * np.sqrt(covariance)
    
#     # Draw the Ellipse
#     for nsig in range(1, 4):
#         ax.add_patch(Ellipse(position, nsig * width, nsig * height,
#                              angle, **kwargs))
        
# def plot_gmm(gmm, lables_, X, label=True, ax=None):
#     ax = ax or plt.gca()
#     labels = lables_
#     if label:
#         ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
#     else:
#         ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
#     ax.axis('equal')
    
#     w_factor = 0.2 / gmm.weights_.max()
#     for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
#         draw_ellipse(pos, covar, alpha=w * w_factor)

In [None]:
#plot_gmm(gmm,cluster_pos, CS_similarity_pos)

## 5.2 Clustering with unigram-based Cosine similarity  <a class="anchor" id="5_2-bullet"></a>

In [None]:
from sklearn.mixture import GaussianMixture


# em Clustering
start_time = time.time()
gmm = em_clustering(CS_similarity_bigram,k)
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
cluster_ngram = gmm.predict(CS_similarity_bigram)

cluster_ngram[:10]

In [None]:
#plot clustering result
dt_plt = pd.DataFrame({'Cluster':cluster_ngram})
dt_plt = dt_plt.groupby('Cluster')['Cluster'].count()

In [None]:
dt_plt

In [None]:
#visualizing clusters

color_iter = ['navy', 'turquoise', 'cornflowerblue','darkorange']

dt_plt.plot(kind ='bar', color = color_iter)
for x,y in zip(dt_plt.index, dt_plt):
    plt.annotate('{}'.format(y ),
                 xy=(x , y + 10),
                 xytext=(0, 3),  # 3 points vertical offset
                 textcoords="offset points",
                 ha='center', va='bottom')
plt.title('CLusters by Ngram cosine similarity')
plt.show()

## 5.3 Integration cluster label with behavioral features   <a class="anchor" id="5_3-bullet"></a>

In [None]:
df_pos=pd.concat([behavior_attr,
                 pd.DataFrame({'cluster':cluster_pos}).reindex(behavior_attr.index)],
#                  pd.DataFrame({'polarity':polarity}).reindex(behavior_attr.index),
#                  pd.DataFrame({'subjective':subjective}).reindex(behavior_attr.index)],
                 #pd.DataFrame(CS_similarity_bigram).reindex(behavior_attr.index)], 
                 axis=1) 
df_ngram=pd.concat([behavior_attr,
                 pd.DataFrame({'cluster':cluster_ngram}).reindex(behavior_attr.index)],
#                  pd.DataFrame({'polarity':polarity}).reindex(behavior_attr.index),
#                  pd.DataFrame({'subjective':subjective}).reindex(behavior_attr.index)],
#                  #pd.DataFrame(CS_similarity_bigram).reindex(behavior_attr.index)], 
                 axis=1) 

In [None]:
# you can extract data for latter steps, so you dont have to run the entire code all the time.

#df_pos.to_csv('df_pos.csv')
#df_ngram.to_csv('df_ngram.csv')

In [None]:
df_pos.head(10)

# 6 Building classify model <a class="anchor" id="6th-bullet"></a>

## 6.1 Features Selection based logistic regression <a class="anchor" id="6_1-bullet"></a>

In [None]:
import statsmodels.api as sm

X = df_pos
y = Y
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X, y)

print('resulting features:')
print(result)



In [None]:
df_filtered = df_pos[['monmembership', 'maxReviewDay', 'reusefulcount', 'avgReviewDay', 'complimentcount', 'avgreviewlen', 'reviewrating', 'avgpostedrating', 'recoolcount', 'tipcount', 'buscateg', 'polarity', 'fancount', 'firstreview', 'subjective', 'cluster']]

## 6.2 Svm model <a class="anchor" id="6_2-bullet"></a>
### a. PoS similarity <a class="anchor" id="6_2_a-bullet"></a>

In [None]:
# entire dataset
# test building classify model with consine similarity
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection
from sklearn.model_selection  import train_test_split
from sklearn.preprocessing import StandardScaler
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn import metrics


In [None]:

def fit_SVM_(X,Y, kernel ='rbf'):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,random_state=138) 
    scaler = StandardScaler()  
    scaler.fit(X)
    X_train = scaler.transform(X_train)  
    X_test = scaler.transform(X_test)
    # SVM kernel: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed
    # standardize the independent variable
    svclassifier = SVC(kernel='rbf')    	## Linear SVM
    svclassifier.fit(X_train, y_train)  
    y_pred = svclassifier.predict(X_test)  	## predict test se
    print(metrics.confusion_matrix(y_test, y_pred))
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Error rate:", 1-metrics.accuracy_score(y_test, y_pred))
    print("Recall:", metrics.recall_score(y_test, y_pred))
    print("Precision:", metrics.precision_score(y_test, y_pred))
    print('{} {} {} {}'.format(metrics.accuracy_score(y_test, y_pred),
                   1-metrics.accuracy_score(y_test, y_pred),
                   metrics.recall_score(y_test, y_pred),
                   metrics.precision_score(y_test, y_pred)))


In [None]:
import time
start_time = time.time()

fit_SVM_(df_pos,Y, kernel ='rbf')

print("--- %s seconds ---" % (time.time() - start_time))


In [None]:
for i in range(0,k):
    df_cluster0 = df_pos[df_pos['cluster'] == i]
    Y = pd.Series(Y)
    print('\n Classifying in cluster: {}'.format(i))
    fit_SVM_(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index], kernel ='rbf')

In [None]:
# withou cluster
fit_SVM_(df_pos.drop('cluster',axis=1),Y, kernel ='rbf' )


### b. Unigram similarity <a class="anchor" id="6_2_b-bullet"></a>

In [None]:
import time
start_time = time.time()

fit_SVM_(df_ngram,Y, kernel ='rbf')

print("--- %s seconds ---" % (time.time() - start_time))


In [None]:
for i in range(0,k):
    df_cluster0 = df_ngram[df_ngram['cluster'] == i]
    Y = pd.Series(Y)
    print('Classifying in cluster: {}'.format(i))
    fit_SVM_(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index], kernel ='rbf')

## 6.3 Random Forest Classifiers <a class="anchor" id="6_3-bullet"></a>

### a. PoS similarity <a class="anchor" id="6_3_a-bullet"></a>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFromModel

In [None]:

def RF_selction(X,Y):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,random_state=138) 
    sel = SelectFromModel(RandomForestClassifier(n_estimators = 500,max_depth = 100))
    sel.fit(X_train, y_train)
    sel.get_support()
    selected_feat= X_train.columns[(sel.get_support())]
    print(len(selected_feat))
    print(selected_feat)


In [None]:
RF_selction(df_pos,Y)

In [None]:
def RF_classifier(X,Y, 
                  max_features_= 'auto', 
                  n_estimators_ = 100,
                  max_depth_ = None,
                  min_sample_leaf_ = 2):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,random_state=138) 
    clf = RandomForestClassifier(max_features = max_features_, 
                                 n_estimators = n_estimators_,
                                 max_depth = max_depth_,
                                 min_samples_leaf = min_sample_leaf_)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)  	## predict test set
    print(metrics.confusion_matrix(y_test, y_pred))
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Error rate:", 1-metrics.accuracy_score(y_test, y_pred))
    print("Recall:", metrics.recall_score(y_test, y_pred))
    print("Precision:", metrics.precision_score(y_test, y_pred))
    print('{} {} {} {}'.format(metrics.accuracy_score(y_test, y_pred),
                   1-metrics.accuracy_score(y_test, y_pred),
                   metrics.recall_score(y_test, y_pred),
                   metrics.precision_score(y_test, y_pred)))
    return(clf)

In [None]:
clf = RF_classifier(df_pos,Y,n_estimators_=500,min_sample_leaf_= 2, max_depth_ = 100)

In [None]:
fi = pd.DataFrame({'feature': list(df_pos.columns),
                  'importance': clf.feature_importances_}).\
                   sort_values('importance', ascending = False)
print(fi.head(10))

dfplot = fi.head(10).sort_values('importance', ascending = True)

N=10
colors = np.random.rand(N)
plt.scatter(dfplot['feature'], dfplot['importance'], s =50, c='tomato', alpha=1, marker ='^')
plt.xticks(rotation=90)
plt.xlabel('Features Name')
plt.ylabel('Importance')
plt.title('Top 10 Importance Features in RF model')
for x,y in zip(range(0,11), dfplot['importance']):
    plt.annotate('{}'.format(round(y,3)),
                 xy=(x , y + 0.01),
                 xytext=(0, 0),  # 3 points vertical offset
                 textcoords="offset points",
                 ha='center', va='bottom')

plt.show()

In [None]:
for i in range(0,k):
    df_cluster0 = df_pos[df_pos['cluster'] == i]
    Y = pd.Series(Y)
    print('\nClassifying in cluster: {}'.format(i))
    RF_classifier(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index],n_estimators_=500,min_sample_leaf_= 2, max_depth_ = 100)

In [None]:
# withou cluster
RF_classifier(df_pos.drop('cluster',axis=1),Y,n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 50 )

In [None]:
# withou text generated features
RF_classifier(df_pos.drop(['polarity','subjective'],axis=1),Y,n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 50 )

### b. Unigram similarity <a class="anchor" id="6_3_b-bullet"></a>

In [None]:
RF_classifier(df_ngram,Y,n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 100 )

In [None]:
for i in range(0,k):
    df_cluster0 = df_ngram[df_ngram['cluster'] == i]
    Y = pd.Series(Y)
    print('\nClassifying in cluster: {}'.format(i))
    RF_classifier(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index],n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 100)

In [None]:
# withou cluster
RF_classifier(df_ngram.drop('cluster',axis=1),Y,n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 100 )

In [None]:
# withou text generated features
RF_classifier(df_ngram.drop(['cluster','polarity','subjective'],axis=1),Y,n_estimators_=150,min_sample_leaf_= 2, max_depth_ = 100 )

## 6.4 Neural Network <a class="anchor" id="6_4-bullet"></a>

### a. PoS similarity <a class="anchor" id="6_4_a-bullet"></a>

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
def NN_classifier(X, Y, hidden_layers_ = (5,2), activation_ ='relu',solver_ ='adam' , alphafloat_ =1e-5):

    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,random_state=138)

    clf = MLPClassifier(solver=solver_,
                        alpha=alphafloat_,
                        activation =activation_,
                        hidden_layer_sizes= hidden_layers_,
                        random_state=1)

    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    cm =  metrics.confusion_matrix(y_test, y_pred)
    print(cm)
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Error rate:", 1-metrics.accuracy_score(y_test, y_pred))
    print("Recall:", metrics.recall_score(y_test, y_pred))
    print("Precision:", metrics.precision_score(y_test, y_pred))
    print('{} {} {} {}'.format(metrics.accuracy_score(y_test, y_pred),
                   1-metrics.accuracy_score(y_test, y_pred),
                   metrics.recall_score(y_test, y_pred),
                   metrics.precision_score(y_test, y_pred)))
#     plt.clf()
#     plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
#     classNames = ['0','1']
#     plt.title('NN Confusion Matrix - Test Data')
#     plt.ylabel('True Class')
#     plt.xlabel('Predicted Class')
#     tick_marks = np.arange(len(classNames))
#     plt.xticks(tick_marks, classNames, rotation=45)
#     plt.yticks(tick_marks, classNames)
#     s = [['TN','FP'], ['FN', 'TP']]
#     for i in range(2):
#         for j in range(2):
#             plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
#     plt.show()

In [None]:
# full clusters

NN_classifier(df_pos,Y,hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh' )

In [None]:
for i in range(0,k):
    df_cluster0 = df_pos[df_pos['cluster'] == i]
    Y = pd.Series(Y)
    print('\nClassifying in cluster: {}'.format(i))
    NN_classifier(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index],hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh' )

In [None]:
#without cluster
NN_classifier(df_pos.drop('cluster',axis=1),Y,hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh' )

### b. Unigram similarity <a class="anchor" id="6_4_b-bullet"></a>

In [None]:
NN_classifier(df_ngram,Y,hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh' )

In [None]:
for i in range(0,k):
    df_cluster0 = df_ngram[df_ngram['cluster'] == i]
    Y = pd.Series(Y)
    print('Classifying in cluster: {}'.format(i))
    NN_classifier(df_cluster0.drop('cluster',axis=1),Y.iloc[df_cluster0.index],hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh' )

In [None]:
# withou cluster
NN_classifier(df_ngram.drop('cluster',axis=1),Y,hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh')

In [None]:
# withou cluster
NN_classifier(df_ngram.drop(['cluster','polarity','subjective'],axis=1),Y,hidden_layers_ =(20,30,40,50), solver_='adam', activation_= 'tanh')

## 6.5 Decision Tree <a class="anchor" id="6_5-bullet"></a>

### a. PoS similarity <a class="anchor" id="6_5_a-bullet"></a>


In [None]:
from sklearn import tree

In [None]:
def DT_clf(X, Y, max_features_= 'auto', 
                  max_depth_ = None,
                  min_sample_leaf_ = 2):
    from sklearn import tree
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20,random_state=138)

    clf = tree.DecisionTreeClassifier(max_features = max_features_, 
                                     max_depth = max_depth_,
                                     min_samples_leaf = min_sample_leaf_)

    clf.fit(X_train, y_train)
    
    #plot the tree
    #tree.plot_tree(clf.fit(X_train, y_train)) 
    
    y_pred = clf.predict(X_test)
    cm =  metrics.confusion_matrix(y_test, y_pred)
    print(cm)
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Error rate:", 1-metrics.accuracy_score(y_test, y_pred))
    print("Recall:", metrics.recall_score(y_test, y_pred))
    print("Precision:", metrics.precision_score(y_test, y_pred))
    plt.clf()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
    classNames = ['0','1']
    plt.title('NN Confusion Matrix - Test Data')
    plt.ylabel('True Class')
    plt.xlabel('Predicted Class')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    s = [['TN','FP'], ['FN', 'TP']]
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
    plt.show()
    return clf
    

In [None]:
dtmodel = DT_clf(df_pos,Y,min_sample_leaf_= 2, max_depth_ = 15)

In [None]:
for i in range(0,k):
    df_cluster0 = df_pos[df_pos['cluster'] == i]
    Y = pd.Series(Y)
    print('Classifying in cluster: {}'.format(i))
    dt_k = DT_clf(df_pos.drop('cluster', axis =1),Y,min_sample_leaf_= 2, max_depth_ = 15)

In [None]:
# withou cluster
dtmodel = DT_clf(df_pos.drop(['cluster'],axis=1),Y,min_sample_leaf_= 2, max_depth_ = 15)

### b. Unigram similarity <a class="anchor" id="6_5_b-bullet"></a>

In [None]:
dtmodel = DT_clf(df_ngram,Y,min_sample_leaf_= 2, max_depth_ = 15)

In [None]:
for i in range(0,k):
    df_cluster0 = df_ngram[df_ngram['cluster'] == i]
    Y = pd.Series(Y)
    print('Classifying in cluster: {}'.format(i))
    dt_k = DT_clf(df_cluster0.drop('cluster', axis =1),Y[df_cluster0.index],min_sample_leaf_= 2, max_depth_ = 15)

In [None]:
# without cluster
dtmodel = DT_clf(df_ngram.drop(['cluster'],axis=1),Y,min_sample_leaf_= 2, max_depth_ = 15)

### c. Visualizing Decision Tree Diagram <a class="anchor" id="6_5_c-bullet"></a>

In [None]:
# lets recreate decision tree model for visualize
# chosing 10 level of depth 
dtmodel = DT_clf(df_ngram, Y, min_sample_leaf_= 2, max_depth_ = 10)

In [None]:
import graphviz 
from IPython.display import SVG
#from StringIO import StringIO

dot_data= tree.export_graphviz(dtmodel, # decision tree model
                               out_file="ngram_8cluster.dot", # name of dot file 
                               feature_names=df_ngram.columns) # features name will be displayed in the tree



In [None]:
import graphviz 
from IPython.display import SVG
from sklearn import tree
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

In [None]:
with open(".\pos_DT_8cluster.dot") as f: # "pos_DT_8cluster.dot" is name of graphviz decision tree file
    dot_data = f.read()

graph = graphviz.Source(dot_data)
graph.format = 'png' # file type
graph.render("ngram_DT_8clusterimage",view=True)