# Using the data published in Shark Tank (US) pitches and deals dataset to train Machine Learning algorithms

After I published the Shark tank pitches and deals dataset, couple of my colleagues asked me how exactly can this dataset be used for predicting a deal on Shark tank.

I am writing this kernel so that it can be an example for them, as well as others, on how the dataset can be used. It can also be considered as a basic NLP and text vectorization starter code for new machine learning students.

Many aspects of this code are based on my personal preference, and they are changeable:  for example, I have used count vectorization from sklearn and selected ngrams. Alternatively, one may also use tfidf vectorization. 

## Step 1 - Importing all libraries and data ##

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
from ggplot import *
import nltk
df = pd.read_csv('../input/Sharktankpitchesdeals.csv')
## Check whether dataset is loaded
df.head()

## Step  2 - Cleaning Data

In [None]:
def data_cleaning(corpus):
    letters_only = re.sub("[^a-zA-Z]", " ", corpus) 
    words = letters_only.lower().split()                            
    return( " ".join( words ))     

### Tips - ready code examples if you want to make changes such as removing stopwords or other specific words from the dataset 

Here one may want to remove stopwords
> from nltk.corpus import stopwords

You may also want to remove some additional words, which you have observed in the dataset, which may not contribute to the learning  

> addedwords = ('service','use','product','line','allow','make','offer','make','provide','products','design','made')

> stop = stopwords.words('english')+list(addedwords)

> df['Pitched_Business_Desc'] = df['Pitched_Business_Desc'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [None]:
df['Pitched_Business_Desc'] = df['Pitched_Business_Desc'].apply(lambda x:data_cleaning(x))
df = df[['Deal_Status','Pitched_Business_Desc']]
for i in range(5):
    print(df['Pitched_Business_Desc'][i])

## Step 3 - Vectorize the data & split in training and testing datasets

In [None]:
## Split into train/test sets
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df,test_size=0.2)

In [None]:
## Vectorize
train_corpus = []
test_corpus = []
for each in train['Pitched_Business_Desc']:
    train_corpus.append(each)
for each in test['Pitched_Business_Desc']:
    test_corpus.append(each)
## Start creating them
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(ngram_range=(2,2))
train_features = v.fit_transform(train_corpus)
test_features=v.transform(test_corpus)

In [None]:
print(train_features.shape)
print(test_features.shape)

## Step 4 - Initiate machine learning algorithms

In [None]:
# Import ML models from sklearn
from sklearn.linear_model import LogisticRegression # Regression classifier
from sklearn.tree import DecisionTreeClassifier # Decision Tree classifier
from sklearn import svm # Support Vector Machine
from sklearn.linear_model import SGDClassifier # Stochastic Gradient Descent Classifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier # Random Forest and Gradient Boosting Classifier
from sklearn.naive_bayes import MultinomialNB # Naive Bayes Classifier 
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix # Some metrics to check the performance of the models

In [None]:
# Setting parameters for each algorithm - these are tunable to achieve max accuracy

Classifiers = {'LR':LogisticRegression(random_state=10,C=5,max_iter=200),
               'DTC':DecisionTreeClassifier(random_state=10,min_samples_leaf=2),
               'RF':RandomForestClassifier(random_state=10,n_estimators=100,n_jobs=-1),
               'GBC':GradientBoostingClassifier(random_state=10,n_estimators=400,learning_rate=0.2),
               'SGD':SGDClassifier(loss="hinge", penalty="l2"),
               'SVM':svm.SVC(kernel='linear', C=0.1),
               'NB':MultinomialNB(alpha=.05)}


In [None]:
# Create a pipeline so you can reuse the code
def ML_Pipeline(clf_name):
    clf = Classifiers[clf_name]
    fit = clf.fit(train_features,train['Deal_Status'])
    pred = clf.predict(test_features)
    Accuracy = accuracy_score(test['Deal_Status'],pred)
    Confusion_matrix = confusion_matrix(test['Deal_Status'],pred)
    print('==='*20)
    print('Accuracy = '+str(Accuracy))
    print('==='*20) 
    print(Confusion_matrix)

## Step 5 - Run the machine learning algorithms in individual blocks

In [None]:
ML_Pipeline('LR')

In [None]:
ML_Pipeline('DTC')

In [None]:
ML_Pipeline('RF')

In [None]:
ML_Pipeline('GBC')

In [None]:
ML_Pipeline('NB')

In [None]:
ML_Pipeline('SVM')

In [None]:
ML_Pipeline('SGD')

# Concluding remarks

As of the current setup, some combinations of ngrams and parameters help us reach ~60% accuracy.

This, I believe, can be further optimized to reach higher accuracy numbers.

I would invite others to work with the data and try to achieve a higher accuracy. 

This kernel uses only the 'Deal_Status' column; Perhaps, someone can also try their hand on the 'Deal_Shark' column to find a model that can give a higher accuracy for a certain Shark/Sharkette or their combination.  

## Step 6 - Parameter Optimization & Tuning Using Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
ftwo_scorer
make_scorer(fbeta_score, beta=2)
parameters = {'kernel':('linear', 'rbf'), 'C':[0.01, 0.1, 1, 10, 100]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, scoring=ftwo_scorer)
clf.fit(train_features,train['Deal_Status'])
print(clf.best_params_)