### Case Study 5 :  Data Science Shark Tank:  Pitch Your Ideas

**Due Date: March 8, 2010, BEFORE the beginning of class at 11:00am**

NOTE: There are always last minute issues submitting the case studies. DO NOT WAIT UNTIL THE LAST MINUTE!

<img src="https://techcrunch.com/wp-content/uploads/2018/10/shark-tank.jpg?w=730&crop=1" width="400px"> 

### Problem 1: the Business Part 
 As a group, learn about the data science related business and research about the current markets: such as search, social media, advertisement, recommendation and so on.
Pick one of the markets for further consideration, and design a new service  which you believe to be important in the market. 
Define precisely in the report and briefly in the cells below, what is the business problem that your team wants to solve.
Why the problem is important to solve? 
Why you believe you could make a big difference with data science technology.
How are you planing to persuade the investors to buy in your idea.

**Please describe here *briefly*  (please edit this cell)**

1) Your business problem to solve:

**There are a huge number of movies but a given streaming service only has the funds to purchase the rights to a limited number.  How should these services chose what types of movies to purchase/produce?**


2) Why the problem is important to solve? 

**This problem is important to solve because a good catalogue of movies can be the difference between a profitable streaming service and a failing one.  Public sentiment is very difficult to predict intuitively, and if a streaming service invests money in purchasing or producing a movie that the public will not respond well to, money will be lost on the movie deal, and if bad movies are consistently the ones brought to the streaming service, people will eventually lose interest in the platform as a whole.**

3) What is your idea to solve the problem? 

**Our idea is to train a machine learning algorithm to detect the sentiment of text, use the algorithm to determine the percentage of people who respond positively to a selection of arbitrary movies, and compare the algorithm’s output to the movies’ IMDb scores and Rotten Tomatoes score to determine which rating system provides a better metric of the public opinion.  We will then run a regression with a number of categories including movie genre, runtime, language, and age rating as the predictors and the chosen rating system as the response to figure out which types of movies people tend to respond well to.  The coefficients of the regression model we find will give very important information about how each aspect of a movie corresponds to its public reception, and this data will be very valuable to an up-and-coming streaming service.**

4) What differences you could make with your data science approach?

**Using data science we will be able to much more accurately predict the types of movies that the public will like, improving the profitability and sustainability of our service compared to our competitors.  Our algorithm will help our company determine where to allocate our money in the most profitable way and ensure we minimize box-office flopping.**

5) Why do you believe the idea deserves the investment of the "sharks"?

**Our idea deserves the investment of the ‘sharks’ because using data science as the basis of a streaming platform is an innovative way of ensuring maximum consumer engagement.  There is a huge amount of money in the streaming services market, and becoming an important part of that market would bring lots of money to our company.  With the ‘sharks’ money we could invest in using our sentiment analysis algorithm to determine our own set of movie ratings rather than depending on IMDb or Rotten Tomatoes, or we could look into different types of regressions to come up with a more accurate algorithm.**


### Problem 2: The Data Part 

Define how Twitter data and at least one other dataset can be combined to make a Data Science product.

In [2]:
# Importing useful libraries and classes
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LinearRegression

import timeit
import numpy as np
import pandas as pd
import twitter
import json
import matplotlib.pyplot as plt
from collections import Counter
from urllib.parse import unquote

In [3]:
# Twitter Authorization
CONSUMER_KEY = "RB4hX8gjnUlPX4Ijvuj5gL9LT"
CONSUMER_SECRET = "YovCvfis70dTuD1IuZMdHdhfiPPAr5nd22QkTIpnELq4r7Dw9j"
OAUTH_TOKEN = "571213367-fyYadzmC7wGWOkM6OCF99ZevVjWGDC3fnO5OoYGr"
OAUTH_TOKEN_SECRET = "OjRD5By0qU0q3g9DJXCpMvnJrdYe1KIj2G2BoGtRng9q5"

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)

In [4]:
# The training data folder must be passed as first argument
dataset = load_files('txt_sentoken', shuffle=False)

# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.25, random_state=None)

# Turning the testing and training docs into TF-IDF tokens 
vectorized = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
Xtrain = vectorized.transform(docs_train)
Xtest = vectorized.transform(docs_test)
y_predictor = MLPClassifier(activation = 'logistic', hidden_layer_sizes = (100,), solver = 'adam').fit(Xtrain, y_train)

In [5]:
# Testing the sentiment of tweets relating to a series of movies, storing them in a dictionary of format [title, ratio of positive tweets to total tweets]
Movies = ['Taxi Driver', 'The Social Network', 'The Matrix Reloaded', 'Hold the Dark', 'Uncorked', 'Stuart Little']
Moviesdict = {}

for q in Movies:
    tweets = []
    count = 100
    search_results = twitter_api.search.tweets(q=q, count=count)
    statuses = search_results['statuses']
    while len(tweets) < 300:
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError as e: # No more results when next_results doesn't exist
            break
        kwargs = dict([ kv.split('=') for kv in unquote(next_results[1:]).split("&") ])
        search_results = twitter_api.search.tweets(**kwargs)
        statuses = search_results['statuses']
        for i in statuses:
            if (not ('retweeted_status' in i.keys())) and i['lang'] == 'en':
                tweets.append(str(i['text']))
    
    tweets2 = []               
    for i in tweets:
        tweets2.append(bytes(i, 'utf-8'))
    X = TfidfVectorizer(ngram_range = (1, 1)).fit(docs_train)
    Xtest = X.transform(tweets2)
    y_predicted = y_predictor.predict(Xtest)
    Moviesdict[q] = round(np.count_nonzero(y_predicted == 1)/len(tweets)*100, 3)


In [9]:
# Reading from the external data source (movie data)
df = pd.read_csv('movies.csv')

# Printing out the calculated public sentiment, IMDb rating, and Rotten Tomatoes rating of each movie
for i in Moviesdict.keys():
    for j in range(len(df['Title'])):
        if (df['Title'][j] == i):
            print(i, "\nTrue Positive Percentage:", Moviesdict[i], "\nIMDb Rating:", df["IMDb"][j], "\nRotten Tomatoes Rating:", df["Rotten Tomatoes"][j], "\n")
    
# IMDb makes more accurate predictions when the IMDb and Rotten Tomatoes reviews are different and
# Rotten Tomatoes makes more accurate predictions when the ratings are very similar, but this
# is much less important, so IMDb is more accurate overall.

Taxi Driver 
True Positive Percentage: 82.555 
IMDb Rating: 8.3 
Rotten Tomatoes Rating: 95% 

The Social Network 
True Positive Percentage: 68.065 
IMDb Rating: 7.7 
Rotten Tomatoes Rating: 96% 

The Matrix Reloaded 
True Positive Percentage: 76.301 
IMDb Rating: 7.2 
Rotten Tomatoes Rating: 73% 

Hold the Dark 
True Positive Percentage: 33.537 
IMDb Rating: 5.6 
Rotten Tomatoes Rating: 73% 

Uncorked 
True Positive Percentage: 49.302 
IMDb Rating: 6.1 
Rotten Tomatoes Rating: 93% 

Stuart Little 
True Positive Percentage: 70.149 
IMDb Rating: 5.9 
Rotten Tomatoes Rating: 67% 



In [7]:
# Dropping every movie without an IMDb rating (about 500 out of over 16000)
df = df[df['IMDb'].notna()]

#establishing predictors and response variables
y = df['IMDb']
predictors = ['13+', '18+', 'all', 'Hours', 'Last 5 years?', 'Action', 'Adventure', 'Sci-Fi', 'Thriller', 'Comedy', 'Western', 'Animation', 'Family', 'War', 'Drama', 'Documentary', 'Biography', 'Mystery', 'Crime', 'Horror', 'Fantasy', 'History', 'Romance', 'Sport', 'Musical']
x = df[predictors]
yClassification = []
for i in y:
    if i >= 6.5:
        yClassification.append(1)
    else:
        yClassification.append(0)

# Fitting a LinearRegression model to calculate IMDb score and a MLPClassifier to calculate
# whether a movie with given properties will be a success or a failure
y_model = LinearRegression().fit(x,y)
y_model_class = MLPClassifier(activation = 'logistic', hidden_layer_sizes = (100,), solver = 'adam').fit(x, yClassification)

# Printing the coefficients of the linear model to get an idea of what makes for a good or 
# a bad movie in the eyes of the public
for i in range(len(predictors)):
    print(predictors[i], ":", round(y_model.coef_[i], 3))

13+ : 0.211
18+ : -0.074
all : 0.087
Hours : 0.289
Last 5 years? : -0.062
Action : -0.264
Adventure : -0.129
Sci-Fi : -0.412
Thriller : -0.172
Comedy : 0.146
Western : 0.161
Animation : 0.722
Family : -0.085
War : 0.049
Drama : 0.38
Documentary : 1.215
Biography : 0.282
Mystery : 0.181
Crime : 0.039
Horror : -0.867
Fantasy : -0.015
History : 0.21
Romance : 0.005
Sport : 0.075
Musical : 0.119


### Problem 3: The Demo Part  


Implement a small Demo/Prototype/experiment result figures for the "product" of your data science company. You could use this demo during the Pitch

In [8]:
# Sample movies one may consider purchasing the rights to
testMovies = {}
testMovies['Citizen Kane'] = np.array([[0, 0, 1, 119.0/60.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])
testMovies['Back to the Future'] = np.array([[0, 0, 1, 116.0/60.0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0]])
testMovies['Grey Garden'] = np.array([[0, 0, 1, 100.0/60.0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
testMovies['Frozen'] = np.array([[0, 0, 1, 109.0/60.0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1]])
testMovies['Cloverfield'] = np.array([[1, 0, 0, 90.0/60.0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

# Sample Netflix original to give an idea of the type of movie current streaming services prioritize
testMovies['Rim of the World'] = np.array([[0, 0, 0, 98.0/60.0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# The type of movie we would suggest producing based on our algorithm
testMovies['Your New Movie!'] = np.array([[1, 0, 0, 180.0/60, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])

# Printing the predicted scores of the test movies
for i in testMovies.keys():
    print("Movie:", i)
    print("Predicted IMDb Score:", y_model.predict(testMovies[i])[0])
    print("Success?", y_model_class.predict(testMovies[i])[0], "\n")

Movie: Citizen Kane
Predicted IMDb Score: 6.4679370000924505
Success? 0 

Movie: Back to the Future
Predicted IMDb Score: 5.403490773288368
Success? 0 

Movie: Grey Garden
Predicted IMDb Score: 7.556566492455293
Success? 1 

Movie: Frozen
Predicted IMDb Score: 6.622892748108609
Success? 1 

Movie: Cloverfield
Predicted IMDb Score: 4.176856489372991
Success? 0 

Movie: Rim of the World
Predicted IMDb Score: 5.262442741864594
Success? 0 

Movie: Your New Movie!
Predicted IMDb Score: 8.630241071433458
Success? 1 

