# **Data Visualization, Recommendations, Sentiment Analysis**

*The amount of data generated per year is increasing at a faster rate than it has ever been. In only a year, the total amount of data on the planet would have grown to 44 zettabytes (44 trillion gigabytes)! In today's terms, it's about 4.4 zettabytes. By 2025, the total amount of data on the planet is projected to reach 175 zettabytes. This rapid expansion of data processing has led to a new age of data.* 

*Visual information is collected, understood, and responded to in less than 250 milliseconds by our brain. Comparing several tables of raw data, on the other hand, necessitates an effort of abstraction and memory that is simply not achievable beyond a certain volume of data. Companies like Netflix, Twitter, and Amazon use data visualization as a solution to exploit their data. Raw data sets can certainly be ambiguous, as readers can draw their own conclusions. This impact is mitigated by data visualization, which makes data more available and shareable*.

*Data is being used to develop more efficient systems and that's where recommendation systems are coming into the picture. Recommender systems will take the input data from user's preferences and suggest similar content that the user may also be interested in.*

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import datetime
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import warnings
warnings.filterwarnings("ignore")

# Importing the Data to perform the Operations

In [None]:
data = pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")
print(data.shape)
data.head()

In [None]:
data.nunique() # To find unique content on each row

In [None]:
data.isnull().sum() # It will give the Count of each column where NaN is present

# Cleaning the Data

**From the above result, we can see that director,cast,country,date_added and rating columns have missing values. First, I'm handling those missing values.**

In [None]:
data.director.fillna("No Director", inplace=True)
data.cast.fillna("No Cast", inplace=True)
data.country.fillna("Country Unavailable", inplace=True)
data.dropna(subset=["date_added", "rating"], inplace=True) # dropped the rows with NaN values in "date_added","rating" columns because they are very few.

In [None]:
data.isnull().sum()

# Data Visualization

# 1) Types of Content Present on Netflix

In [None]:
Visualization = px.pie(values=data['type'].value_counts(), 
             names=data['type'].value_counts().index,title='Type of Content on Netflix')

Visualization.show()

**From the above pie chart, we can clearly see that the Movie content is more comare to the TV shows.**

# 2) Displaying the content type based on the selected Country

**Certain films and television shows have many country names. We only took into account the first country name that appeared in the country column.**

In [None]:
data['country'] = [countries[0] for countries in data['country'].str.split(',')]

In [None]:
def visualise_country(country):
    if (country == ALL):
        data_vis = data
    
    else:
        data_vis = data[data.country == country]
        
    Visualization = px.pie(values=data_vis['type'].value_counts(), 
             names=data_vis['type'].value_counts().index, 
             title=f'Total number of TV-Shows and Movies from {country}.')
    Visualization.show()

**The "ipywidgets" library was used to pick the country we wanted to view using a dropdown menu. The function visualise_country represents the number of TV shows and movies available in the selected country.**

In [None]:
import ipywidgets as widgets
from ipywidgets.widgets.interaction import show_inline_matplotlib_plots

ALL = 'ALL'
def total_unique_country_names(array):
    unique = array.unique().tolist()
    unique.sort()
    unique.insert(0, ALL)
    return unique

dropdown_country = widgets.Dropdown(options = total_unique_country_names(data.country))
output_country = widgets.Output()

def dropdown_country_eventhandler(change):
    output_country.clear_output()
    with output_country:
        display(visualise_country(change.new))
        
dropdown_country.observe(dropdown_country_eventhandler, names='value')
display(dropdown_country)

In [None]:
display(output_country)

# 3)Top 15 Countries producing the content to Netflix

In [None]:
data_country = data['country'].value_counts().sort_values(ascending=False)
top15countries = data_country.head(15)
top15countries

In [None]:
Visualization = px.pie(values=top15countries, 
                       names=top15countries.index,title='Top 15 Countries producing the content to Netflix')

Visualization.show()

# 4) Ratings classification on Netflix

In [None]:
Ratings = data['rating'].value_counts()
Ratings

In [None]:
Visualization = px.funnel(Ratings,title='Types of Rating on Netflix')

Visualization.show()

**With the above data, it's difficult to say which are only applicable to Kids, Adults, etc.. because there are many ratings which are related to Kids and Adults. So better understanding for the end users, we are groping the ratings and displaying the result.**

**1. Adults**
* R - Restricted. May be inappropriate for ages 17 and under.
* TV-MA - For Mature Audiences. May not be suitable for ages 17 and under.
* NC-17 - Inappropriate for ages 17 and under

**2. Teens**
* PG-13 - Parents strongly cautioned. May be Inappropriate for ages 12 and under.
* TV-14 - Parents strongly cautioned. May not be suitable for ages 14 and under.

**3. Kids**

* TV-Y - Designed to be appropriate for all children
* TV-Y7 - Suitable for ages 7 and up
* G - Suitable for General Audiences
* TV-G - Suitable for General Audiences
* PG - Parental Guidance suggested
* TV-PG - Parental Guidance suggested

*Note: TV and movie ratings may vary by region. The above ratings are applicable only to the United States.*

In [None]:
def group_by_rating(rating):
    if rating in ['TV-Y', 'TV-Y7', 'TV-Y7-FV', 'G', 'TV-G', 'PG', 'TV-PG']:
        new_rating = 'Kids'
    elif rating in ['PG-13', 'TV-14']:
        new_rating = 'Teens'
    elif rating in ['R', 'NC-17', 'TV-MA']:
        new_rating = 'Adults'
    else:
        new_rating = 'Unrated'
    return new_rating
        

data['rating_group'] = data.apply(lambda x: group_by_rating(x['rating']), axis=1)

print(data.rating_group.value_counts())

order_rating = ['Kids', 'Teens', 'Adults', 'Unrated']

Visualization = px.bar(y = data['rating_group'].value_counts(), 
             x = data['rating_group'].value_counts().index,
             labels = dict(x="Rating", y="Total Number"),
             title = 'TV-Shows and Movies Rating in Netflix'
            )

Visualization.update_xaxes(categoryorder = 'array', categoryarray= order_rating)

Visualization.show()

# 5) Movies Wordcloud

In [None]:
Movie_Names = data[data['type'] == 'Movie'].title

text = list(Movie_Names)

plt.rcParams['figure.figsize'] = (15, 15)

wordcloud = WordCloud(max_words=1000000,background_color="White").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.margins(x=3, y=1)
plt.show()

# 6) Finding Top rated Movies

**For finding Top rated movies, we are adding one more dataset "imdb-extensive-dataset". We will join this with Netflix data and display the top rated movies by matching the "Title" in both data sets.**

In [None]:
imdb_movie_names = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv',
                               usecols=['title', 'year', 'avg_vote'])

new_ratings = pd.DataFrame({'Title':imdb_movie_names.title,
                    'Rating': imdb_movie_names.avg_vote,
                           'Year' : imdb_movie_names.year})

new_ratings.drop_duplicates(subset=['Title','Year','Rating'], inplace=True)
print(new_ratings.shape)
new_ratings.head(5)

**We applied an inner join on the 'new_ratings' dataset and netflix dataset to get the titles that has both ratings on IMDB and are available on Netflix.**

In [None]:
Inner_join_data = new_ratings.merge(data,left_on='Title', right_on='title', how='inner')
Inner_join_data=Inner_join_data.sort_values(by='Rating', ascending=False)

In [None]:
top_rated=Inner_join_data[0:15]
fig =px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()

In [None]:
countries_data = Inner_join_data['country'].value_counts().sort_values(ascending=False)
country_count = pd.DataFrame(countries_data)
Top_countries = country_count.head(15)

In [None]:
Visualization = px.bar(Top_countries, title = "Countries with highest rated content")
Visualization.show()

# Interacting Visualizations using iPlot, Seaborn

**Plotly provides a variety of APIs that range in complexity from low-level to high-level. The most convenient API for general-purpose use is iplot, which is the highest-level API.**

**These graphs are interactive in every way. The toolbar on the top-right can be used to perform various operations on the data, such as zooming and panning. A tooltip appears when we hover over a data point. The plot can also be saved as a PNG picture.**

In [None]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

iplot([go.Scatter(x=data['country'], y=data['director'], mode='markers')])

In [None]:
# iplot([go.Histogram2dContour(x=data.head(15)['country'], 
#                              y=data.head(15)['type'], 
#                              contours=go.Contours(coloring='heatmap')),
#        go.Scatter(x=data['country'].head(20), y=data['type'].head(20), mode='markers')])

In [None]:
# df = data.assign(n=0).groupby(['release_year', 'country'])['n'].count().reset_index()
# df = df[df['release_year'] < 2010 ]
# v = df.pivot(index='release_year', columns='country', values='n').fillna(0).values.tolist()

# iplot([go.Surface(z=v)])

In [None]:
# Visualization using Choropleth

df = data['country'].value_counts()

iplot([go.Choropleth(
    locationmode='country names',
    locations=df.index.values,
    text=df.index,
    z=df.values,
)])

In [None]:
sns.countplot(data['rating'])

**Unlike pandas, seaborn doesn't require us to use value counts to form the data; instead, the countplot aggregates the data**

In [None]:
sns.kdeplot(data.query('release_year > 2015').release_year)

**When it comes to determining the "real shape" of interval results, a KDE plot outperforms a line map.**

# Recommendations

**We are doing this with the Content based Recommendation. For this, we have considered the information from these columns - description, cast, director, genre and find the similarity of the movies which are present in the dataset.**

**Steps involved in finding the recommendation.**

* First, convert the text data to matrix form using the CountVectorizer function.
* Next perform Cosine similarity on the data matrix and convert it into a list of tuples where the first element is its index and second is the similarity score.
* Sort the tuples based on the highest similarity score.
* After that, get the Index of the movie which user wants recommendations.
* Then pass the Index value to Cosine similarity matrix that we calculated above.
* Finally, return the top 10 movies which have similar cosine value.

In [None]:
# These two are usefel for the recommendations

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

data.columns

In [None]:
features = ['description', 'cast', 'director', 'listed_in']

In [None]:
def combine_features(row):
    return row['description']+" "+row['cast']+" "+row['director']+" "+row['listed_in']

In [None]:
data["combined_features"] = data.apply(combine_features,axis=1) #applying combined_features() method over each rows of dataframe and storing the combined string in "combined_features" column

In [None]:
data.iloc[0].combined_features

In [None]:
cv = CountVectorizer() #creating new CountVectorizer() object
count_matrix = cv.fit_transform(data["combined_features"]) #feeding combined strings(movie contents) to CountVectorizer() object

In [None]:
cosine_sim = cosine_similarity(count_matrix) #It will calculate the cosine similarity of the data present in count_matri that is all the combined features 
cosine_sim

In [None]:
# Adding Index column to the dataset for unique identification

i =[]
for j in range(0, len(data)):
    i.append(j)
data["index"] = i

In [None]:
data.columns

data.tail(5)

In [None]:
# functions to get the title and index

def get_title_from_index(index):
    return data[data.index == index]["title"].values[0]

def get_index_from_title(title):
    return data[data.title == title]["index"].values[0]

In [None]:
recommend_movie = "3 Idiots"
movie_index = get_index_from_title(recommend_movie)
similar_movies = list(enumerate(cosine_sim[movie_index]))

In [None]:
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

In [None]:
i=0
print("Top 10 similar movies to "+recommend_movie+" are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i+= 1
    if i>9:
        break

# Sentiment Analysis on IMDB reviews

**Sentiment Analysis is widely used to find the opinion of the customers such as reviews, survey responses in websites or social media. Since the customers are
expressing their thoughts, feelings and opinions more openly than ever before, sentiment analysis is becoming an essential tool to monitor and understand that sentiment in their reviews, comments, feedback etc.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
IMDB_Reviews = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv', low_memory=False)
IMDB_Reviews.head(3)

In [None]:
New_ratings = pd.DataFrame({'Title':IMDB_Reviews.title,
                    'Rating': IMDB_Reviews.avg_vote})

New_ratings.drop_duplicates(subset=['Title', 'Rating'], inplace=True)

print(New_ratings.shape)
New_ratings.head(5)

In [None]:
Inner_join_data = New_ratings.merge(data,left_on='Title', right_on='title', how='inner')
Inner_join_data=Inner_join_data.sort_values(by='Rating', ascending=False)

print(Inner_join_data.shape)
Inner_join_data.head(5)

In [None]:
New_Data = Inner_join_data[['Title', 'Rating', 'type']]

New_Data.drop_duplicates(subset=['Title','Rating', 'type'], inplace=True)
print(New_Data.shape)
New_Data.head(5)

In [None]:
Movies_Data = New_Data[New_Data.type == 'Movie']
TV_Data = New_Data[New_Data.type == 'TV Show']
print(Movies_Data.shape)
print(TV_Data.shape)

In [None]:
Movies_Data = Movies_Data.drop(['type'], axis=1)

Movies_Data

In [None]:
Movies_Data['Polarity_Rating'] = Movies_Data['Rating'].apply(lambda x: 'Positive' if x > 6 else 'Negative')
Movies_Data

In [None]:
fig = px.pie(values=Movies_Data['Polarity_Rating'].value_counts(), 
             names=Movies_Data['Polarity_Rating'].value_counts().index)
fig.show()

In [None]:
Positive = Movies_Data[Movies_Data['Polarity_Rating'] == 'Positive']
Negative = Movies_Data[Movies_Data['Polarity_Rating'] == 'Negative']

print(Positive.shape)
print(Negative.shape)

In [None]:
df = Movies_Data[['Title','Polarity_Rating']]
df

In [None]:
one_hot = pd.get_dummies(df["Polarity_Rating"])
df.drop(['Polarity_Rating'],axis=1,inplace=True)
df = pd.concat([df,one_hot],axis=1)
df

In [None]:
X = df['Title'].values
y = df.drop('Title', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
X_train

In [None]:
y_train

In [None]:
vect = CountVectorizer()
X_train = vect.fit_transform(X_train)
X_test = vect.transform(X_test)

tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
X_train = X_train.toarray()
X_test = X_test.toarray()

In [None]:
model = Sequential()

model.add(Dense(units=12673,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=4000,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=500,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=2, activation='sigmoid'))

opt=tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['binary_accuracy'])

early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)

In [None]:
xyz= model.fit(x=X_train, y=y_train, batch_size=50, epochs=80, validation_data=(X_test, y_test), verbose=1, 
               callbacks=early_stop)
xyz

In [None]:
model_score = model.evaluate(X_test, y_test, batch_size=64, verbose=1)
print('Test accuracy:', model_score[1])

In [None]:
a = pd.DataFrame(xyz.history)

a.loc[1:, ['loss', 'val_loss']].plot()
a.loc[1:, ['binary_accuracy', 'val_binary_accuracy']].plot()

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(a['val_loss'].min(), 
              a['val_binary_accuracy'].max()))

In [None]:
predict = model.predict(X_test)
predict