**Netflix Dataset Solutions**

The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development. The main focus of our notebook is to provide Data acquisition, Data Exploration, Data Preprocessing, Model training and evaluation and Presentation and report writing, to provide the most optimal prediction of our chosen dataset.


**Workflow stages**

The competition solution workflow goes through seven stages described in the Data Science Solutions book.
×	Question or problem definition.
×	Acquire training and testing data.
×	Wrangle, prepare, and cleanse the data.
×	Analyze, identify patterns, and explore the data.
×	Model, predict and solve the problem.
×	Visualize, report, and present the problem solving steps and final solution.
×	Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.
×	We may combine multiple workflow stages. We may analyze by visualizing data.
×	Perform a stage earlier than indicated. We may analyze data before and after wrangling.
×	Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
×	Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.


**Question and problem definition**
*TV Shows and Movies listed on Netflix*

This dataset consists of TV shows and movies available on Netflix as of 2019. The dataset is collected from Flexible which is a third-party Netflix search engine.
In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.
Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.
Inspiration
Some of the interesting questions (tasks) which can be performed on this dataset -
×	Understanding what content is available in different countries
×	Identifying similar content by matching text-based features
×	Network analysis of Actors / Directors and find interesting insights
×	Is Netflix has increasingly focusing on TV rather than movies in recent years.


In [None]:
#libraries Imported
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
from datetime import datetime
import re
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


In [None]:
#accquiring data
netflix_ds = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
netflix_ds1=netflix_ds


In [None]:
#analyze data columns
print(netflix_ds.columns.values)


**Which features are categorical?**
These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.
•	Categorical: Title, Type, and Show Id, Country, Description, Director, Cast, , Listed in.

**Which features are numerical?**
Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous or time series based? Among other things this helps us select the appropriate plots for visualization.
•	Continuous: Release Years.


In [None]:
netflix_ds.info()

In [None]:
# preview the data from the start
netflix_ds.head()


In [None]:
# preview the data from the end
netflix_ds.tail()


**Which features are mixed data types?**
The data can be Numerical alphanumeric data within same feature. These are candidates for correcting goal.
•	Rating, Duration, is a mix of numeric and alphanumeric data types. Show Id is alphanumeric.

**Which features may contain errors or typos?
Which features contain blank, null or empty values?
What are the data types for various features?**
This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.
•	Director, Cast, Release Year, date added, Country features may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

**Which features may contain errors or typos?
Which features contain blank, null or empty values?
What are the data types for various features?**
These will require correcting.
Director, Cast, Release Year, Date added features contain a number of null values in that order for the training dataset.

**Which features may contain errors or typos?
Which features contain blank, null or empty values?
What are the data types for various features?**
•	One feature is integer.
•	Eleven features are strings (object).


In [None]:
#datatype of columns
netflix_ds1=netflix_ds
netflix_ds.info()
print('_'*40)


**What is the distribution of numerical feature values across the samples?**
This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.
•	Total samples are 7787.
•	Release years with year 2013 with 1012 values is the most used


In [None]:
netflix_ds.describe()

**What is the distribution of categorical features?**
•	Show id are unique across the dataset (count=unique=7787)
•	Type variable as two possible values with 65% male (top=movie, freq=5377/count=7787).
•	Country values have several duplicates across samples. Alternatively several passengers shared a cabin.
•	Rating takes 14 possible values. TV-MA most popular
•	Date added feature has high ratio of duplicate values (unique=1565).


In [None]:
netflix_ds.describe(include=['O'])

**Assumptions based on data analysis**

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

**Correlating:**

We want to know how well does each feature correlate with type either movie or TV show. We want to do this early in our project and match these quick correlations with modeled correlations later in the project.

**Completing:**

1.	We may want to complete date added feature as it is definitely correlated to type.
2.	We may want to complete the country feature as it may also correlate with type or another important feature.


**Correcting:**

1.	Country feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation with type.
2.	Country feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
3.	Show id, title, description and listed after extracting important information from them, may be dropped from training dataset as it does not contribute to survival.
4.	Title and description feature is relatively non-standard, may not contribute directly to type, so maybe dropped.

**Creating:**

1.	We may want to create a new feature called genre extracted from Listed in 
2.	We may want to engineer the rating feature to extract numbers as a new feature.
3.	We may want to create new feature for continent. This turns a continuous numerical feature into an ordinal categorical feature.

**Classifying:**

We may also add to our assumptions based on the problem description noted earlier.
1.	Movie (type=movie) were more likely to have made in United States.
2.	Rating TV_MA was more likely to be movies.
3.	The movie duration was more likely to be 90 min.


In [None]:
d2=netflix_ds[(netflix_ds['type']=='Movie')]
d2.describe(include=['O'])


**Converting a categorical feature**

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal. Let us start by converting type feature to a new feature called Gender where TV Show=1 and movie=2.



In [None]:

#conversion of a type column into numerical data
title_mapping = {"TV Show": 1, "Movie": 0}
netflix_ds['type'] = netflix_ds['type'].map(title_mapping)
netflix_ds.head()


In [None]:
#conversion of a Duration into numerical data
netflix_ds['duration'] = netflix_ds.duration.str.extract('([0-9]+)', expand=False)
netflix_ds['duration'] = pd.to_numeric(netflix_ds['duration'])
netflix_ds.head()


In [None]:
#conversion of a Show Id into numerical data
netflix_ds['show_id'] = netflix_ds.show_id.str.extract('([0-9]+)', expand=False)
netflix_ds['show_id'] = pd.to_numeric(netflix_ds['show_id'])
netflix_ds.head()


In [None]:
#conversion of a rating column into numerical data
netflix_ds['rating'] = netflix_ds['rating'].astype(str)
r = {'TV-MA':1, 'R': 2,  'PG-13':3, 'TV-14':4, 'TV-PG':5 ,'NR':6 ,'TV-G':7 ,'TV-Y':8 , 'TV-Y7':9, 'PG':10, 'G':11, 'NC-17': 12,  'TV-Y7-FV' :13, 'UR':14}
netflix_ds['rating'] = netflix_ds['rating'].map(r)
netflix_ds['rating'] = netflix_ds['rating'].fillna(1)
netflix_ds['rating'] = netflix_ds['rating'].astype(int)
netflix_ds.head()
netflix_ds.head()

In [None]:
#conversion of a date added into numerical data
netflix_ds['date_added'] = pd.to_datetime(netflix_ds['date_added'], errors= "coerce")
dateTimeObj = datetime.now()
netflix_ds['date_added'] = pd.DatetimeIndex(netflix_ds['date_added']).year
df=netflix_ds['date_added'].value_counts().idxmax()
print(netflix_ds.date_added.describe())
netflix_ds['date_added'] = netflix_ds['date_added'].fillna(2019)
netflix_ds['date_added'] = netflix_ds['date_added'].astype(int)
netflix_ds.head()


**Creating new feature extracting from existing**

We want to analyze if listed in feature can be engineered to extract genre and test correlation between genre and type, and continent from country  before dropping Name and listed in features.

**Observations:**

When we plot Genre, type, we note the following observations.

* Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
* Type among genre bands varies slightly.
* Certain genre mostly movies like sports, children and family and rest are TV shows like drama.

**Decision:**
We decide to retain the new Genre and continent from country feature for model training.


In [None]:
#Extraction of genre added from Listed in column
netflix_ds['genre']=netflix_ds.listed_in.str.extract(r'(Horror|Action & Adventure|Sci-Fi & Fantasy|Romantic|Comedies|Dramas|Sports|Trillers|Classic|cult|Children & Family|Science & Nature|Music)', expand=False)
g={"Horror": 1,"Action & Adventure": 2,"Sci-Fi & Fantasy": 3, "Romantic": 4, "Comedies": 5, "Dramas": 6, "Sports": 7, "Trillers": 8, "Classic": 9, "cult": 10, "Children & Family": 11, "Science & Nature": 12}
#Extraction of genre added into numerical data
netflix_ds['genre'] = netflix_ds['genre'].map(g)
netflix_ds['genre'] = netflix_ds['genre'].fillna(0)
netflix_ds['genre'] = netflix_ds['genre'].astype(int)
pd.crosstab(netflix_ds['genre'], netflix_ds['type'])
netflix_ds[["genre", "type"]].groupby(['genre'], as_index=False).mean().sort_values(by='type', ascending=False)


In [None]:
#Group Country in Continent
#conversion of a country into numerical data
con=netflix_ds.country.unique()
asia=['Russian', 'China', 'India', 'Kazakhstan','Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia',  'Pakistan',  'Turkey',  'Myanmar',  'Afghanistan',  'Yemen',  'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam','Malaysia' ,'Oman', 'Philippines','Laos', 'Kyrgyzstan', 'Nepal','Tajikistan','North Korea',' South Korea', 'Jordan', 'Azerbaijan','syria', 'combodia' ,'Bangladash', 'United Arab Emirates','Georgia', 'Sri Lanka', 'Bhutan', 'Taiwan', 'Armenia', 'Israel kuwait', 'Timor-Leste', 'Qatar', 'Lebanon','Cyprus', 'Palestine','Brunei','Bahrain','Singapore', 'Maldives']
europe=['Germany','United Kingdom','France','Italy','Spain','Ukraine','Poland','Romania','Netherlands','Belgium','Czech Republic','Greece','Portugal','Sweden','Hungary','Belarus','Austria','Serbia','Switzerland','Bulgaria','Denmark','Finland','Slovakia','Norway','Ireland','Croatia','Moldova','Bosnia','Albania','Lithuania','North Macedonia','Slovenia','Latvia','Estonia','Montenegro','Luxembourg','Malta','Iceland','ndorra','Monaco','Liechtenstein','San Marino','Holy See']
Africa=['Ethiopia', 'Nigeria','Egypt','DR Congo','Tanzania','South Africa','Kenya','Uganda','Algeria','Sudan','Morocco','Angola','Mozambique','Ghana','Madagascar','Cameroon','Côte dIvoire','Niger','Burkina Faso','Mali','Malawi','Zambia','Senegal','Chad','Somalia','Zimbabwe','Guinea','Rwanda','Benin','Burundi','Tunisia','South Sudan','Togo','Sierra Leone','Libya','Congo','Liberia','Central African Republic','Mauritania','Eritrea','Namibia','Gambia','Botswana','Gabon','Lesotho','Guinea-Bissau','Equatorial Guinea','Mauritius','Eswatini','Djibouti','Co','Cabo Verde','Sao Tome','Seychelles']
Australia=['Micronesia', 'Fiji', 'Kiribati', 'Marshall Islands', 'Nauru', 'New Zealand', 'Palau', 'Papua New Guinea', 'Samoa','Solomon Islands', 'Tonga', 'Tuvalu','Vanuatu']
America=['Anguilla','United Kingdom','Barbuda','Argentina','Aruba','Netherlands','Bahamas','Barbados','Belize','Bermuda','Bolivia','Bonaire','Norway','Brazil','British Virgin Islands','Canada','Cayman Islands','Chile','Clipperton Island','Colombia','Costa Rica','Cuba','Curaçao','Dominica','Dominican Republic','Ecuador','El Salvador','Falkland Islands','French Guiana' ,'Greenland','Denmark','Grenada','Guadeloupe','Guatemala','Guyana','Haiti','Honduras','Jamaica','Martinique','Mexico','Montserrat','Navassa Island','United States','Nicaragua','Panama','Paraguay','Peru','Puerto Rico','Saba','Saint Barthélemy','Saint Kitts','Saint Lucia','Saint Martin','Saint Pierre','Saint Vincent','Sint Eustatius','Sint Maarten', 'South Georgia','South Sandwich Islands','Suriname','Trinidad','Tobago','Turks','Caicos Islands','Virgin Islands','United States of America','Uruguay','Venezuela']
netflix_ds['continenta']=netflix_ds.country.str.contains(r'(Russian|China|India|Kazakhstan|Saudi Arabia|Iran|Mongolia|Indonesia|Pakistan|Turkey|Myanmar|Afghanistan|Yemen|Thailand|Turkmenistan|Uzbekistan|Iraq|Japan|Vietnam|Malaysia|Oman|Philippines|Laos|Kyrgyzstan|Nepal|Tajikistan|North Korea|South Korea|Jordan|Azerbaijan|syria|combodia|Bangladash|United Arab Emirates|Georgia|Sri Lanka|Bhutan|Taiwan|Armenia|Israel|kuwait|Timor-Leste|Qatar|Lebanon|Cyprus|Palestine|Brunei|Bahrain|Singapore|Maldives)')
netflix_ds.loc[netflix_ds.continenta == True, "continenta"] = "Asia"
netflix_ds['continente']=netflix_ds.country.str.contains(r'(Germany|United Kingdom|France|Italy|Spain|Ukraine|Poland|Romania|Netherlands|Belgium|Czech Republic|Greece|Portugal|Sweden|Hungary|Belarus|Austria|Serbia|Switzerland|Bulgaria|Denmark|Finland|Slovakia|Norway|Ireland|Croatia|Moldova|Bosnia|Albania|Lithuania|North Macedonia|Slovenia|Latvia|Estonia|Montenegro|Luxembourg|Malta|Iceland|ndorra|Monaco|Liechtenstein|San Marino|Holy See)')
netflix_ds.loc[netflix_ds.continente == True, "continente"] = "Europe"
netflix_ds['continentaf']=netflix_ds.country.str.contains(r'(Ethiopia| Nigeria|Egypt|DR Congo|Tanzania|South Africa|Kenya|Uganda|Algeria|Sudan|Morocco|Angola|Mozambique|Ghana|Madagascar|Cameroon|Côte dIvoire|Niger|Burkina Faso|Mali|Malawi|Zambia|Senegal|Chad|Somalia|Zimbabwe|Guinea|Rwanda|Benin|Burundi|Tunisia|South Sudan|Togo|Sierra Leone|Libya|Congo|Liberia|Central African Republic|Mauritania|Eritrea|Namibia|Gambia|Botswana|Gabon|Lesotho|Guinea-Bissau|Equatorial Guinea|Mauritius|Eswatini|Djibouti|Co|Cabo Verde|Sao Tome|Seychelles)')
netflix_ds.loc[netflix_ds.continentaf == True, "continentaf"] = "Africa"
netflix_ds['continentau']=netflix_ds.country.str.contains(r'(Micronesia| Fiji|Kiribati|Marshall Islands|Nauru|New Zealand|Palau|Papua New Guinea|Samoa|Solomon Islands|Tonga|Tuvalu|Vanuatu)')
netflix_ds.loc[netflix_ds.continentau == True, "continentau"] = "Australia"
netflix_ds['continentam']=netflix_ds.country.str.contains(r'(Anguilla|United Kingdom|Barbuda|Argentina|Aruba|Netherlands|Bahamas|Barbados|Belize|Bermuda|Bolivia|Bonaire|Norway|Brazil|British Virgin Islands|Canada|Cayman Islands|Chile|Clipperton Island|Colombia|Costa Rica|Cuba|Curaçao|Dominica|Dominican Republic|Ecuador|El Salvador|Falkland Islands|French Guiana |Greenland|Denmark|Grenada|Guadeloupe|Guatemala|Guyana|Haiti|Honduras|Jamaica|Martinique|Mexico|Montserrat|Navassa Island|United States|Nicaragua|Panama|Paraguay|Peru|Puerto Rico|Saba|Saint Barthélemy|Saint Kitts|Saint Lucia|Saint Martin|Saint Pierre|Saint Vincent|Sint Eustatius|Sint Maarten| South Georgia|South Sandwich Islands|Suriname|Trinidad|Tobago|Turks|Caicos Islands|Virgin Islands|United States of America|Uruguay|Venezuela)')
netflix_ds.loc[netflix_ds.continentam == True, "continentam"] = "America"
#conversion of a rating column into numerical data
#conversion of a rating column into numerical data
contin = {'Asia':1, 'Europe': 2,  'Africa':3, 'Australia':4, 'America':5}
netflix_ds['continenta'] = netflix_ds['continenta'].map(contin)
netflix_ds['continente'] = netflix_ds['continente'].map(contin)
netflix_ds['continentaf'] = netflix_ds['continentaf'].map(contin)
netflix_ds['continentau'] = netflix_ds['continentau'].map(contin)
netflix_ds['continentam'] = netflix_ds['continentam'].map(contin)
netflix_ds['continenta'] = netflix_ds['continenta'].fillna(0)
netflix_ds['continente'] = netflix_ds['continente'].fillna(0)
netflix_ds['continentaf'] = netflix_ds['continentaf'].fillna(0)
netflix_ds['continentau'] = netflix_ds['continentau'].fillna(0)
netflix_ds['continentam'] = netflix_ds['continentam'].fillna(0)
netflix_ds['continent'] = netflix_ds['continenta']+netflix_ds['continente']+netflix_ds['continentaf']+netflix_ds['continentau']+netflix_ds['continentam']
netflix_ds['continent'] = netflix_ds['continent'].astype(int)
netflix_ds.loc[(netflix_ds.continent>5),'continent']=5
netflix_ds['continent'] = netflix_ds['continenta']+netflix_ds['continente']+netflix_ds['continentaf']+netflix_ds['continentau']+netflix_ds['continentam']
netflix_ds['continent'] = netflix_ds['continent'].astype(int)
netflix_ds.loc[(netflix_ds.continent>5),'continent']=5
netflix_ds = netflix_ds.drop(['continenta','continente','continentaf', 'continentau', 'continentam'], axis=1)
netflix_ds.head(10)

**Analyze by pivoting features**

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (type), ordinal (Duration) or discrete (Rating, date added and year released) type.
•	Date Added : We observe significant correlation (>1) among. We decide to include this feature in our model.


In [None]:
#Pivot date _added
netflix_ds[['date_added', 'type']].groupby(['date_added'], as_index=False).mean().sort_values(by='type', ascending=False)

In [None]:
#Pivot Release year
netflix_ds[['release_year', 'type']].groupby(['release_year'], as_index=False).mean().sort_values(by='type', ascending=False)

In [None]:
#Pivot Continent
netflix_ds[['continent', 'type']].groupby(['continent'], as_index=False).mean().sort_values(by='type', ascending=False)

In [None]:
#pivot Rating
netflix_ds[['rating', 'type']].groupby(['rating'], as_index=False).mean().sort_values(by='type', ascending=False)

In [None]:
#Pivot Genre
netflix_ds[['genre', 'type']].groupby(['genre'], as_index=False).mean().sort_values(by='type', ascending=False)

In [None]:
#pivot Duration
netflix_ds[['genre', 'type']].groupby(['genre'], as_index=False).mean().sort_values(by='type', ascending=False)

**Analyze by visualizing data**

Let us start by understanding correlations between numerical features and our solution goal (type). A histogram chart is useful for analyzing continuous numerical variables like date added where extracting there year will help solve problem. The histogram can indicate distribution of samples using automatically. Note that x-axis in histogram visualizations represents the count of samples.

**Observations:**

•	2019 most movies where added.
•	Oldest passengers (Age = 80) survived.
•	2009 to 2012 only movies were added.
•	Most samples are in 1925 to 2021 range.
•	Range can be quite complex as int numbers presents date

**Decisions:**

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.
•	We should consider date added in our model training.
•	Complete the date added feature for null values by replacing them with the most frequent.
•	We should extract year from date added to simplify.


In [None]:
#Correlating numerical features: type (Tv show or movie) and date added on Netflix
h2 = sns.FacetGrid(netflix_ds, col='type')
h2.map(plt.hist, 'date_added', bins=5)


**Correlating numerical and ordinal features**

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

**Observations:**

•	The description is useless to our data so we drop it.
•	The country can be  quit long so w have grouped it into a new column continent
•	The genre is extracted from listed in as it can tell more about the type  
•	From release year 1942 1984 most of the type released were movies 

**Decisions:**

•	Consider release year for model training.
•	Duration


In [None]:
#Correlating numerical features: type (Tv show or movie) and release year on Netflix
h2 = sns.FacetGrid(netflix_ds, col='type')
h2.map(plt.hist, 'release_year', bins=5)
h2 = sns.FacetGrid(netflix_ds, col='type')
h2.map(plt.hist, 'duration', bins=5)


**Correlating categorical features:**

Now we can correlate categorical features with our solution goal.

**Observations:**

•	Most rating categories are movies and are more.
•	The genre can help us identify type either movie or TV show.

**Decisions:**

•	Add Rating feature to model training.
•	Add Genre feature to model training.


In [None]:
#Correlating numerical features: type (Tv show or movie) and Rating and genre on Netflix
h2 = sns.FacetGrid(netflix_ds, col='type')
h2.map(plt.hist, 'rating', bins=5)
h2 = sns.FacetGrid(netflix_ds, col='type')
h2.map(plt.hist, 'genre', bins=5)


**Wrangle data**

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

**Correcting by dropping features**

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.
Based on our assumptions and decisions we want to drop the description, title, show id, listd in country director and cast after extracting useful information.


In [None]:
#Dropping Of Unwanted Columns
print("Before", netflix_ds.shape)
netflix_ds = netflix_ds.drop(['title','country','cast','director', 'listed_in', 'description'], axis=1)
"After", netflix_ds.shape


In [None]:
#After complete conversion into numerical data
netflix_ds.head()

**Model, predict and solve**

Now we are ready to train a model and predict the required solution. There are 60+ predictive modeling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (type TV show or movie) with other variables or features (Genre, continent, year...). We are also performing a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

×	Logistic Regression
×	KNN or k-Nearest Neighbors
×	Support Vector Machines
×	Naive Bayes classifier
×	Decision Tree
×	Random Forrest
×	Perceptron
×	Artificial neural network
×	RVM or Relevance Vector Machine


In [None]:
#divide data into train and test
#model Data:
feature_cols = ['continent', 'date_added','release_year', 'rating','duration','genre']
X = netflix_ds[feature_cols] # Features
y = netflix_ds.type # Target variable
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)# Logistic Regression
X_train.shape, y_train.shape, X_test.shape


In [None]:
#Logistic Regression Algorithm
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
print(acc_log)


In [None]:
#corelation b?w columns
coeff_df = pd.DataFrame(netflix_ds.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=True)


In [None]:
#svc algorithm:
svc = SVC()
svc.fit(X_train, y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print(acc_svc)


In [None]:
#KNN algorithm:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, y_train) * 100, 2)
print(acc_knn)


In [None]:
#Naive Bayes algorithm:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, y_train) * 100, 2)
print(acc_gaussian)


In [None]:
#Preceptron algorithm:
perceptron = Perceptron()
perceptron.fit(X_train, y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, y_train) * 100, 2)
print(acc_perceptron)


In [None]:
#linear svc algorithm:
linear_svc = LinearSVC()
linear_svc.fit(X_train, y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, y_train) * 100, 2)
acc_linear_svc


In [None]:
#sgd algorithm:
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, y_train) * 100, 2)
acc_sgd


In [None]:
#decision Tree algorithm:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
print(acc_decision_tree)


In [None]:
#Random Forest algorithm:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
print(acc_random_forest)


In [None]:
#model Evaluation
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)


In [None]:
#Model submission
submission = pd.DataFrame({
        "type": Y_pred
    })
submission.sort_values(by='type', ascending=False)
