### Amazon Movie Prime Dataset

**Variables:**
The data set contains the 
1. name of the show or title, 
2. year of the release which is the year in which the show was released or went on-air, 
3. No.of seasons means the number of seasons of the show which are available on Prime, 
4. Language is for the audio language of the show and does not take into consideration the language of the subtitles, 
5. genre of the show like Kids, Drama, Action and so on, 
6. IMDB ratings of the show: though for many tv shows and kid shows the rating was not available,      
7. Age of Viewers is to specify the age of the target audience- All in age means that the content is not restricted to any particular age group and all audiences can view it.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Import data
data= pd.read_csv('../input/amazon-prime-tv-shows/Prime TV Shows Data set.csv', encoding = 'iso-8859-1')
data.head()

### Data Preprocession and Cleaning

In [None]:
data.drop('S.no.',axis=1,inplace=True)     ## Dropping Serial Number column

In [None]:
data.isnull().sum()      ## Checking Null values

In [None]:
type(data)      ## type of data

In [None]:
data.nunique()       ## number of unique values in every column

In [None]:
data['Language'].unique()         ## Different languages in which movies are released on Amazon prime.

In [None]:
data['Genre'].unique()         ### Type of MOvies released on Amazon Prime.

In [None]:
data['Age of viewers'].unique()      ## unique Age of viewers/ The age group who used to watch Amazon Prime movies.

In [None]:
data['Name of the show'].nunique()    

In [None]:
data.shape

In [None]:
data.max()    ## Maximum number of movies released in 2020 and seasons available is 20 and maximum rating is 9.5 on AMAzon Prime

In [None]:
data.min()    ## Maximum number of movies released in 1926 and seasons available is 1 and maximum rating is 3.7 on Amazon Prime

In [None]:
data.describe(include='all')        ##DEscribe data by including all categorical or nominal data and quantitative datatype.

In [None]:
data.describe()

In [None]:
data['Genre'].value_counts().head(2)    ## Drama is most frequent value   (showing the most frequenct(Mode)value)

In [None]:
data['Genre'].fillna(data['Genre'].value_counts().index[0],inplace=True)    ## filling the missing value

In [None]:
data['Name of the show'].fillna(data['Name of the show'].value_counts().index[0],inplace=True)   ## filling the missing value

In [None]:
data['No of seasons available'].fillna(data['No of seasons available'].value_counts().index[1],inplace=True)   ## filling the missing value

In [None]:
data['Language'].fillna(data['Language'].value_counts().index[0],inplace=True)     ## filling the missing value

In [None]:
data['Age of viewers'].fillna(data['Age of viewers'].value_counts().index[0],inplace=True)    ## filling the missing value

In [None]:
data['Year of release'].fillna(data['Year of release'].value_counts().index[0],inplace=True)   ## filling the missing value

In [None]:
data.isnull().sum()   

- IMDb Rating has higher number of missing value filling this value can be biased or misguide the model and also our interpretation.filling missing value by using mean or median could be cause of an outlier. and also  can affect the model badly i.e biasedness error could be high possibly.

In [None]:
data['IMDb rating'].value_counts().head(2)

### Data Visualization

In [None]:
import matplotlib.pyplot as plt

In [None]:
data.sample(2)

In [None]:
data.corr(method='pearson')

In [None]:
import seaborn as sns
plt.subplots(figsize=(12,6))
sns.set(font_scale=1.2)
sns.heatmap(data.corr(), annot=True)
plt.show()

#### There is a very low or moderate positive and negative relationships between the variables which is somehow negligible. but we will test about this by using the concept of VIF.

In [None]:
from plotnine import *
import warnings
warnings.filterwarnings('ignore')

In [None]:
data.hist(bins=50,figsize=(20,15))
plt.show()

###### The year of release is positively distributed that is release is increased in amazon prime as the time goes i.e we can say that amazon prime has amazing growth with the time.

In [None]:
sns.boxplot(x ='Age of viewers',y = 'IMDb rating',data = data,palette ='rainbow')   ##Checking Outliers  

In [None]:
sns.boxplot(x ='IMDb rating',y = 'Age of viewers',data = data,palette ='rainbow')   ##Checking Outliers  

In [None]:
sns.countplot(x = "Age of viewers", data =data)

#### The mostly used amazon prime is by the viewers who is 16+ in age and 2nd is 18+ age of viewers.

In [None]:
sns.countplot(y= "Language", data = data)

#### English is the 1st language most prefered by the viewers and hindi is the 2nd language most prefered by the viewers

In [None]:
figsize=(20,15)
sns.distplot(data['IMDb rating'])        ## density plot

In [None]:
sns.pairplot(data)

#### From the above graph we can say that 'data is not linearly related and also not normally distributed'.

In [None]:
from wordcloud import WordCloud
from wordcloud import STOPWORDS
stopwords = set(STOPWORDS)
#### Wordcloud showing the most frequent 'Genre released by Amazon prime'.
wordcloudG=WordCloud(max_font_size=40, relative_scaling=.5,background_color='White',stopwords=stopwords).generate(data['Genre'].str.cat())
plt.imshow(wordcloudG, interpolation="bilinear")
plt.axis('off')
plt.margins(x=0, y=0) 
plt.show()
plt.savefig("donaldwc.png")

In [None]:
#### Wordcloud showing the most frequent 'Name of the show released by Amazon prime'.
wordcloud2=WordCloud(max_font_size=40, relative_scaling=.5,background_color='White',stopwords=stopwords).generate(data['Name of the show'].str.cat())
plt.imshow(wordcloud2, interpolation="bilinear")
plt.axis('off')
plt.margins(x=0, y=0) 
plt.show()
plt.savefig("donaldwc.png")

In [None]:
# Top 10 TV shows in the genre: 'Drama'
top_drama = data[data['Genre'] == 'Drama'].sort_values(by = 'IMDb rating',ascending = False)
#Top 10 TV shows in drama
top_drama.head(10)

In [None]:
# Let us now take a look at 20 worst rated shows
data.sort_values(by = "IMDb rating", ascending = True).head(20)

In [None]:
top_english = data[data['Language'] == 'English'].sort_values(by = 'IMDb rating',ascending = False)
#Top 10 TV shows in english with the highest rating by viewers
top_english.head(10)

### outlier treatment using knnimputer

In [None]:
Q1=data['IMDb rating'].quantile(0.25)
Q3=data['IMDb rating'].quantile(0.75)
IQR=Q3-Q1
print(Q1)
print(Q3)
print(round(IQR,3))
Lower_Whisker = Q1-1.5*IQR
Upper_Whisker = Q3+1.5*IQR
print(round(Lower_Whisker,3), round(Upper_Whisker,3))

In [None]:
data.head(1)

#### KNN Imputer for imputing missing values in IMDB rating 
- As there are large number of missing values present in Imdb rating we can not fill this variable using mean,median or mode as this my cause the problem of biasedness in the data. So we are going to use Knn imputer.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop('Year of release', 1), data['Year of release'],
                                                   test_size=0.2, random_state=5)

In [None]:
import math as m
m.sqrt(len(data))    # Formula to decide the k-value in KNN

K=20

In [None]:
num = [col for col in X_train.columns if X_train[col].dtypes != 'O']
X_train[num].head()

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors= 20,add_indicator=True)

imputer1 = imputer.fit_transform(X_train[num])

print("\nImpute with 20 Neighbour: \n", imputer1)

In [None]:
imputer.transform(X_train[num])

In [None]:
data2=pd.DataFrame(imputer.transform(X_train[num]))  # Converting train array into dataframe

In [None]:
X_test[num].isna().sum()


In [None]:
imputer.transform(X_test[num])

In [None]:
data1=pd.DataFrame(imputer.transform(X_test[num]))     # Converting test array into dataframe

In [None]:
pd.DataFrame(imputer.transform(X_test[num])).isna().sum().sum()

In [None]:
df = pd.concat([data2, data1])

In [None]:
df.head(2) 

In [None]:
df = df.rename({0:"No of seasons available",1:"IMDb ratingss"}, axis='columns')
df

In [None]:
data.shape

In [None]:
df.drop(2, 1,inplace=True)   

In [None]:
df.head(2)

In [None]:
df.isnull().sum() 

In [None]:
Q1=df['IMDb ratingss'].quantile(0.25)
Q3=df['IMDb ratingss'].quantile(0.75)
IQR=Q3-Q1
print('First Quartile',Q1)
print('Thrid Quartile',Q3)
print('IQR-',round(IQR,3))
Lower_Whisker = Q1-1.5*IQR
Upper_Whisker = Q3+1.5*IQR
print('\n')
print('Lower Whisker value is -',round(Lower_Whisker,3))
print('Uower Whisker value is -', round(Upper_Whisker,3))

In [None]:
print(df['IMDb ratingss'].min())
print(df['IMDb ratingss'].max())

after imputing missing value we can see that there is a presence of an outlier in the rating,in data minimum rating is 3.7 and maximmu rating is 9.5 \
while Lower Whisker value is 6.625 and Upper whisker value is 8.345 which shows that  \
- Minimum value=3.7<<lower whisker=6.625, so there is a presence of lower outliers
- Maximum value=9.5>>upper whisker=8.345, so there is a presence of upper outliers.

-- let's visualize this by boxplot.

In [None]:
sns.boxplot(df['IMDb ratingss'])

- Since,there numerous outliers present we can not remove this values because if we remove it then there is a loss of information will occur in large amount.

### Analysis

In [None]:
df.head(2)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# the independent variables set
X = df[['No of seasons available','IMDb ratingss']]
  
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  
print(vif_data)

- if VIF=1 ; Not correlated   
- If 1<VIF<5 ; Moderately correlated    
- if VIF>=5,10; Highly correlated   

so here our vif values lie between 1 and 5 so the data is moderately or less correlated, so we can say that there is a chances of less risk.

- PCA is use to reduce the severe multicollinearity issue.Let's try either this method will give the best performance on this data or not.
- PCA performs best with a normalized feature set. so we will perform standard scalar normalization to normalize our feature set.

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(data2)
X_test= sc.transform(data1)

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
X_train1 = pca.fit_transform(X_train)
X_test1 = pca.transform(X_test)

In [None]:
explained_variance = pca.explained_variance_ratio_

In [None]:
explained_variance

- The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components.
- The explained_variance variable is now a float type array which contains variance ratios for each principal component.
- We can see that first principal component is responsible for 43.33% variance and second principal component causes 36.16% variance in the dataset. Collectively we can say that (43.33 + 36.16)=79.49% percent of the classification information contained in the feature set is captured by the first two principal components, which is very less, so we will not go for pca because it will not give us a best accuracy and removing multicollinearity here in this case is not good.

In [None]:
df1= pd.merge(data, df, right_index=True, left_index=True)   ###Merge the actual data and cleaned trained data.
df1

In [None]:
top_drama = df1[df1['Genre'] == 'Drama'].sort_values(by = 'IMDb ratingss',ascending = False)   
#Top 10 TV shows in drama which has the highest rating
top_drama.head(10)

### Statistical Tests

In [None]:
crosstab1=pd.crosstab(df1['No of seasons available_y'], df1['IMDb ratingss'])
crosstab1

In [None]:
pd.crosstab(index=df1['Age of viewers'],columns=df1['Genre'],dropna=True)      ####Two Way Tables

In [None]:
crosstab12=pd.crosstab(df1['Age of viewers'], df1['Language'])
crosstab12

In [None]:
from scipy import stats
from scipy.stats import chi2_contingency
print("Chi Square Test")
print("\n")

print("Null Hypothesis : There is no correlation (independent relation) present between the two variables")
print("Alternative Hypothesis : There is correlation(dependent relation) present between the two variables")
print("\n")

chi, pval, dof, exp = chi2_contingency(crosstab12)
alpha=0.05

print("Chi Square Test statistic : %.3f, p value : %.6f" % (chi, pval))

if pval > alpha:
    print('Variables are non correlated (fail to reject H0)')
else:
    print('Variables are correlated (reject H0)') 

- This result shows that 'choosing language of movies has relation with the age'

In [None]:
df1.head(1)

In [None]:
print("One way ANOVA")
print("\n")

print("Null Hypothesis : There is no significant difference in the means of the groups")
print("Alternative Hypothesis : There is a significant difference between any one group in the means of the groups")
print("\n")

col1 = list(df1.columns.values)[1]     #Year of release    
col2 = list(df1.columns.values)[7]     #Number of seasons available
col3 =  list(df1.columns.values)[8]    #IMDb ratings

test,p=stats.f_oneway(df1[col1], df1[col2],df1[col3])

print("One way ANOVA statistic : %.3f, p value : %.6f" % (test, p))

if p > alpha:
    print('There is no significant difference among the groups (fail to reject H0)')
else:
    print('There is a significant difference among the groups (reject H0)')

- This result shows that 'Significant difference among the groups means that the Year of movie release and number of seasons it is available and rating given by the viewers are all related with each other.'

**If you learnt something new Upvote the notebook**

## Thank You