### We have the data for the 100 top-rated movies from the past decade along with various pieces of information about the movie, its actors, and the voters who have rated these movies online. In this, we will try to find some interesting insights into these movies and their voters, using Python.

In [None]:
## Let's Filter Out the warnings first

import warnings
warnings.filterwarnings('ignore')

In [None]:
## Let's Import the necessary libraries to analyse and visiulize our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We have already uploaded the data set for the IMBD Top 100 movies. So lets read the dataset first and then go ahead with analysis for the same.

In [None]:
## Read the csv file using 'read_csv'. Please write your dataset location here.

movies = pd.read_csv('/kaggle/input/imdb-movies-data-set/MovieAssignmentData.csv')

As of now, we have loaded our dataset in a variable called "movies". Let's examine the loaded dataset for basic knowledge about the columns, rows, index, null values etc.

In [None]:
movies.head()

In [None]:
## Check the no. of Rows & Columns

movies.shape

In [None]:
movies.info()

In [None]:
movies.describe()

#### As we can clearly interpret that there are very less null values in 5 columns and those columns do not contain any such matric that can hinder our analysis. Therefore, lets move towards the data manipulation, analysis, and visualisation to get various insights about the data. 

These numbers in the `budget` and `gross` are too big, compromising its readability. Let's convert the unit of the `budget` and `gross` columns from `$` to `million $` first.

In [None]:
# Divide the 'gross' and 'budget' columns by 1000000 to convert '$' to 'million $'

movies.budget = movies.budget / 1000000
movies.Gross = movies.Gross / 1000000

In [None]:
## Examine the data for changes

movies.head()

In [None]:
## Lets us visualize the null values for all the columns present in the date. First Line gives us leverage 
## to show all the columns present in data set.


pd.set_option('display.max_rows',100)
movies.isnull().sum()

1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
2. Sort the dataframe using the `profit` column as reference.
3. Extract the top ten profiting movies in descending order and store them in a new dataframe - `top10`.
4. Plot a scatter or a joint plot between the columns `budget` and `profit` and write a few words on what you observed.
5. Extract the movies with a negative profit and store them in a new dataframe - `neg_profit`

In [None]:
## Lets Create a new column "profit" = Gross - budget

movies['profit'] = movies.Gross - movies.budget

movies.head()

In [None]:
# Sort the dataframe with the 'profit' column as reference using the 'sort_values' function. 
# Make sure to set the argument 'ascending' to 'False'

movies.sort_values(by='profit', ascending=False, inplace=True)
movies.reset_index(drop=True, inplace=True)

In [None]:
movies.head()

In [None]:
# Get the top 10 profitable movies by using position based indexing. Specify the rows till 10 (0-9)

top10 = movies.iloc[0:10]
top10

In [None]:
#Plot profit vs budget

plt.figure(figsize=[8,10])
plt.scatter(movies.budget, movies.profit)

plt.title('Relationship b/w Profit and Budget', fontdict= {'fontsize':30, 'fontweight':5, 'color':'Green'})

plt.xlabel('Budget',fontdict= {'fontsize':15, 'fontweight':5, 'color':'Brown'} )
plt.ylabel('Profit', fontdict= {'fontsize':15, 'fontweight':5, 'color':'Brown'})
plt.show

We can clearly see the relationship between Profit and budget, in the start of the plot we can see that as the budget increases the profit also increases. But as we go from left to right, we see that profit of some movies goes to negative as the budget increases. Hence, we can conclude that there is no linear relationship b/w these two variales.

The dataset contains the 100 best performing movies from the year 2010 to 2016. However, the scatter plot tells a different story. You can notice that there are some movies with negative profit. Although good movies do incur losses, but there appear to be quite a few movie with losses. What can be the reason behind this? Lets have a closer look at this by finding the movies with negative profit.

In [None]:
#Lets Find the movies with negative profit

neg_profit = movies[movies['profit']<0].sort_values(by='profit', ascending=True)

neg_profit

We can spot the movie Tangled in the dataset? We may be aware of the movie 'Tangled'. Although its one of the highest grossing movies of all time, it has negative profit as per this result. If we cross check the gross values of this movie (link: https://www.imdb.com/title/tt0398286/), we can see that the gross in the dataset accounts only for the domestic gross and not the worldwide gross. This is true for may other movies also in the list.

We might have noticed the column MetaCritic in this dataset. This is a very popular website where an average score is determined through the scores given by the top-rated critics. Second, you also have another column IMDb_rating which tells you the IMDb rating of a movie. This rating is determined by taking the average of hundred-thousands of ratings from the general audience.

As a part of this subtask, you are required to find out the highest rated movies which have been liked by critics and audiences alike.

Firstly you will notice that the MetaCritic score is on a scale of 100 whereas the IMDb_rating is on a scale of 10. First convert the MetaCritic column to a scale of 10.
Now, to find out the movies which have been liked by both critics and audiences alike and also have a high rating overall, you need to -

1. Create a new column Avg_rating which will have the average of the MetaCritic and Rating columns
2. Retain only the movies in which the absolute difference(using abs() function) between the IMDb_rating and Metacritic columns is less than 0.5. Refer to this link to know how abs() funtion works - https://www.geeksforgeeks.org/abs-in-python/ .
3. Sort these values in a descending order of Avg_rating and retain only the movies with a rating equal to or greater than 8 and store these movies in a new dataframe UniversalAcclaim.

In [None]:
movies.columns

In [None]:
movies.MetaCritic.head()


In [None]:
movies.IMDb_rating.head()

In [None]:
movies.MetaCritic = movies.MetaCritic/10
movies.MetaCritic.head()

In [None]:
## Let's Find the Avg_rating of the movies by taking the mean of "MetaCritic" & "IMDb_rating"
## The Avg_rating will be stored in a new Column of the dataframe (movies)

movies['Avg_Rating'] = movies.loc[:, ['MetaCritic', 'IMDb_rating']].mean(axis=1)

movies.head()

In [None]:
## Let's analyze the Avg_Rating column values now.

movies.Avg_Rating.describe()

"Avg_Rating" column's numerical analysis clearly states :
1. All the values lies b/w Rating 6.9 to 8.95.
2. 75% values are on or above rating 8.10
3. Max Rating for the film is 8.95.


In [None]:
## Our current DataFrame is sorted based on the "Profit" earlier to find the Neg-Profit movies.

## Let's sort the DataFrame by "Avg_rating" descending now.

movies.sort_values(by="Avg_Rating", ascending=False, inplace=True)
movies.reset_index(drop=True, inplace=True)

movies.head()

In [None]:
##  Find the movies with metacritic-Imdb rating < 0.5 
## Also with an average rating of >= 8 (sorted in descending order)

UniversalAcclaim = movies[(abs(movies['IMDb_rating']-movies['MetaCritic'])<0.5) & (movies['Avg_Rating']>=8)]

UniversalAcclaim.sort_values(by='Avg_Rating', ascending=False, inplace=True)
UniversalAcclaim.reset_index(drop=True, inplace=True)

UniversalAcclaim.head()

In [None]:
## Let's have look at our DataFrame now.

movies.head()

### Find the Most Popular Trios - I
You're a producer looking to make a blockbuster movie. There will primarily be three lead roles in your movie and you wish to cast the most popular actors for it. Now, since you don't want to take a risk, you will cast a trio which has already acted in together in a movie before. The metric that you've chosen to check the popularity is the Facebook likes of each of these actors.

The dataframe has three columns to help you out for the same, viz. actor_1_facebook_likes, actor_2_facebook_likes, and actor_3_facebook_likes. Your objective is to find the trios which has the most number of Facebook likes combined. That is, the sum of actor_1_facebook_likes, actor_2_facebook_likes and actor_3_facebook_likes should be maximum. Find out the top 5 popular trios, and output their names in a list.

In [None]:
### Let's add a new column with name "TotalLikes"

movies['TotalLikes'] = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes

## Now lets sort the data as per "TotalLikes"

movies.sort_values(by="TotalLikes", ascending=False, inplace=True)
movies.reset_index(drop=True, inplace=True)

movies.head()

In [None]:
### Let's put Output for top 5 trios in a list

top_5_triolist = movies.head(5)[['actor_1_name','actor_2_name','actor_3_name']].values.tolist()

top_5_triolist

### Find the Most Popular Trios - II
In the previous subtask you found the popular trio based on the total number of facebook likes. Let's add a small condition to it and make sure that all three actors are popular. The condition is none of the three actors' Facebook likes should be less than half of the other two. For example, the following is a valid combo:

actor_1_facebook_likes: 70000
actor_2_facebook_likes: 40000
actor_3_facebook_likes: 50000
But the below one is not:

actor_1_facebook_likes: 70000
actor_2_facebook_likes: 40000
actor_3_facebook_likes: 30000
since in this case, actor_3_facebook_likes is 30000, which is less than half of actor_1_facebook_likes.

Having this condition ensures that you aren't getting any unpopular actor in your trio (since the total likes calculated in the previous question doesn't tell anything about the individual popularities of each actor in the trio.).

You can do a manual inspection of the top 5 popular trios you have found in the previous subtask and check how many of those trios satisfy this condition. Also, which is the most popular trio after applying the condition above?


In [None]:
act1= movies['actor_1_facebook_likes']/2
act2= movies['actor_2_facebook_likes']/2
act3= movies['actor_3_facebook_likes']/2

a=((movies['actor_1_facebook_likes'] > act1) & (movies['actor_1_facebook_likes']>act3))
b=((movies['actor_2_facebook_likes'] > act2) & (movies['actor_2_facebook_likes']>act3))
c=((movies['actor_3_facebook_likes'] > act3) & (movies['actor_3_facebook_likes']>act2))
eligible=a & b & c

In [None]:
## Lets add the eligible column in DataFrame - movies

movies['eligible']=eligible

movies.loc[eligible, ['eligible', 'actor_1_name','actor_2_name','actor_3_name']]

### Runtime Analysis
There is a column named Runtime in the dataframe which primarily shows the length of the movie. It might be intersting to see how this variable this distributed. Plot a histogram or distplot of seaborn to find the Runtime range most of the movies fall into.

In [None]:
## Let's Plot a histogram b/w the Runtime & Count of movies to determine the ideal length.

plt.figure(figsize=[8,6])
sns.set_style("whitegrid")

Runtime_plot = sns.histplot(data=movies, x="Runtime",  bins=10, color="Red", stat ="count")
 
Runtime_plot.set_title("Distribution of Movies RunTime", fontdict = {"fontsize":20, 'fontweight':10, 'color':'Green'})
Runtime_plot.set_xlabel('Runtime of the Movie', fontdict = {"fontsize":15, 'fontweight':5, 'color':'Brown'})
Runtime_plot.set_ylabel('Count of Movies', fontdict = {"fontsize":15, 'fontweight':5, 'color':'Brown'})

plt.show()

#### We can clearly see that most no. of movies have the Run Time of 2 hours. Hence, this can be concluded that a 2 hour movie will perform good in market in term on RunTime.

## R-Rated Movies
Although R rated movies are restricted movies for the under 18 age group, still there are vote counts from that age group. Among all the R rated movies that have been voted by the under-18 age group, find the top 10 movies that have the highest number of votes i.e.CVotesU18 from the movies dataframe. Store these in a dataframe named PopularR.

In [None]:
movies.columns

In [None]:
movies.content_rating.head(10)

In [None]:
PopularR=movies[movies['content_rating']=='R'].sort_values(by='CVotesU18', ascending=False)

PopularR.reset_index(drop=True, inplace=True)

PopularR[['Title', 'CVotesU18']].head(10)


#### Yeah, Kids under 18 are watching the "deadpool" a lot since, this movie has a '4598' votes from under 18 group.

## Task 3 : Demographic analysis
If you take a look at the last columns in the dataframe, most of these are related to demographics of the voters (in the last subtask, i.e., 2.8, you made use one of these columns - CVotesU18). We also have three genre columns indicating the genres of a particular movie. We will extensively use these columns for the third and the final stage of our assignment wherein we will analyse the voters across all demographics and also see how these vary across various genres. So without further ado, let's get started with demographic analysis.

### Subtask 3.1 Combine the Dataframe by Genres
There are 3 columns in the dataframe - genre_1, genre_2, and genre_3. As a part of this subtask, you need to aggregate a few values over these 3 columns.

1. First create a new dataframe df_by_genre that contains genre_1, genre_2, and genre_3 and all the columns related to CVotes/Votes from the movies data frame. There are 47 columns to be extracted in total.
2. Now, Add a column called cnt to the dataframe df_by_genre and initialize it to one. You will realise the use of this column by the end of this subtask.
3. First group the dataframe df_by_genre by genre_1 and find the sum of all the numeric columns such as cnt, columns related to CVotes and Votes columns and store it in a dataframe df_by_g1.
4. Perform the same operation for genre_2 and genre_3 and store it dataframes df_by_g2 and df_by_g3 respectively.
5. Now that you have 3 dataframes performed by grouping over genre_1, genre_2, and genre_3 separately, it's time to combine them. For this, add the three dataframes and store it in a new dataframe df_add, so that the corresponding values of Votes/CVotes get added for each genre.There is a function called add() in pandas which lets you do this. You can refer to this link to see how this function works. https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.add.html
6. The column cnt on aggregation has basically kept the track of the number of occurences of each genre.Subset the genres that have atleast 10 movies into a new dataframe genre_top10 based on the cnt column value.
7. Now, take the mean of all the numeric columns by dividing them with the column value cnt and store it back to the same dataframe. We will be using this dataframe for further analysis in this task unless it is explicitly mentioned to use the dataframe movies.
8. Since the number of votes can't be a fraction, type cast all the CVotes related columns to integers. Also, round off all the Votes related columns upto two digits after the decimal point.

In [None]:
movies.head()

In [None]:
movies.columns

In [None]:
movies.shape

In [None]:
## Lets create new Data Frame "df_by_genre" which has columns starting from 'genre-1' till all CVotes/Votes. 

## After examining we can clearly see that we will need to extract the columns from 12-60 & drop columns (MetaCritic', 'Runtime')

df_by_genre = movies.iloc[:,11:60]
df_by_genre = df_by_genre.drop(columns=['MetaCritic', 'Runtime'])
df_by_genre.shape

In [None]:
df_by_genre.head()

In [None]:
### Let's add a new column 'cnt' and intialize it to 1

df_by_genre['cnt']=1

In [None]:
df_by_genre.head()

In [None]:
### Lets group the movies by the all genres

df_by_genre_1 = df_by_genre.groupby(by='genre_1').sum()
df_by_genre_2 = df_by_genre.groupby(by='genre_2').sum()
df_by_genre_3 = df_by_genre.groupby(by='genre_3').sum()

In [None]:
## Add the grouped Date Frames and store in a new DataFrama 'df_add'

df_add = df_by_genre_1.add(df_by_genre_2, fill_value=0)
df_add = df_add.add(df_by_genre_3, fill_value=0)

df_add

#### We can clearly see the "cnt" column has entries equal to the times the same genre was present in the data set. This count increased when we grouped the data by individual genres in the dataset.

In [None]:
### Let's Extract the Top 10 Genres by see that the Genre has atleast 10 occurences

genre_top_10 = df_add[df_add['cnt']>=10]

genre_top_10

In [None]:
### Let's Take the mean for every column by dividing from 'cnt' column, since mean will give the most consolidated statistics about the popular genre.

genre_top_10.iloc[:, 0:44] = genre_top_10.iloc[:, 0:44].divide(genre_top_10.cnt, axis=0)

genre_top_10.head()

In [None]:
# Rounding off the columns of Votes to two decimals

genre_top_10 = genre_top_10.apply(lambda x : round(x,2), axis=0)

genre_top_10.head()

In [None]:
### Now, lets covert all the Cvotes columns in to Integer Type - To plot the Graphs

CVotes=[]
for i in genre_top_10.columns:
    if i.startswith('CVotes'):
        CVotes.append(i)
        
genre_top_10[CVotes] = genre_top_10[CVotes].astype('int32')

genre_top_10.head()

#### If you take a look at the final dataframe that you have gotten, you will see that you now have the complete information about all the demographic (Votes- and CVotes-related) columns across the top 10 genres. We can use this dataset to extract exciting insights about the voters!

### Subtask 3.2: Genre Counts!

Now let's derive some insights from this data frame. Make a bar chart plotting different genres vs cnt using seaborn.

In [None]:
## Lets change the index to Genre and name the Index = Genre

genre_top_10 = genre_top_10.reset_index()
genre_top_10 = genre_top_10.rename(columns={"index":"Genre"})

genre_top_10.head()

In [None]:
## Now, lets plot a Countplot for Genres vs count 

plt.figure(figsize=[10,8])
sns.set_style("whitegrid")

genre_plot = sns.barplot(y=genre_top_10.cnt, color='Red', x= genre_top_10.Genre)
 
genre_plot.set_title("Movies Genre VS Count", fontdict = {"fontsize":20, 'fontweight':10, 'color':'Green'})
genre_plot.set_xlabel('Genre of the Movie', fontdict = {"fontsize":15, 'fontweight':5, 'color':'Brown'})
genre_plot.set_ylabel('Count of Movies', fontdict = {"fontsize":15, 'fontweight':5, 'color':'Brown'})

plt.show

#### This can be clearly interpreted by the Countplot that - "Drama" is the most watched Genre in the movies.

### Subtask 3.3: Gender and Genre

If you have closely looked at the Votes- and CVotes-related columns, you might have noticed the suffixes F and M indicating Female and Male. Since we have the vote counts for both males and females, across various age groups, let's now see how the popularity of genres vary between the two genders in the dataframe.

1. Make the first heatmap to see how the average number of votes of males is varying across the genres. Use seaborn heatmap for this analysis. The X-axis should contain the four age-groups for males, i.e., CVotesU18M,CVotes1829M, CVotes3044M, and CVotes45AM. The Y-axis will have the genres and the annotation in the heatmap tell the average number of votes for that age-male group.

2. Make the second heatmap to see how the average number of votes of females is varying across the genres. Use seaborn heatmap for this analysis. The X-axis should contain the four age-groups for females, i.e., CVotesU18F,CVotes1829F, CVotes3044F, and CVotes45AF. The Y-axis will have the genres and the annotation in the heatmap tell the average number of votes for that age-female group.

3. Make sure that you plot these heatmaps side by side using subplots so that you can easily compare the two genders and derive insights.

4. Write your any three inferences from this plot. You can make use of the previous bar plot also here for better insights. Refer to this link- https://seaborn.pydata.org/generated/seaborn.heatmap.html. You might have to plot something similar to the fifth chart in this page (You have to plot two such heatmaps side by side).

5. Repeat subtasks 1 to 4, but now instead of taking the CVotes-related columns, you need to do the same process for the Votes-related columns. These heatmaps will show you how the two genders have rated movies across various genres.

You might need the below link for formatting your heatmap. https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot

Note : Use genre_top10 dataframe for this subtask

In [None]:
## First, Lets make the Genre as Index First, earlier we just changed the name of Genre Column.

genre_top_10=genre_top_10.set_index('Genre')
genre_top_10.head()

In [None]:
genre_top_10.columns

In [None]:
### Lets make new dataframe with average Cvotes of M & F

#creating pivot_table for heat map of the average number of votes of males across the genres
CVotes_M=pd.pivot_table (data=genre_top_10, index='Genre',values=("CVotesU18M","CVotes1829M","CVotes3044M","CVotes45AM"))

#creating pivot_table for heat map of the average number of votes of females across the genres
CVotes_F=pd.pivot_table (data=genre_top_10, index='Genre',values=("CVotesU18F","CVotes1829F","CVotes3044F","CVotes45AF"))

In [None]:
CVotes_M

In [None]:
CVotes_F

In [None]:
# 1st set of heat maps for CVotes-related columns

fig, ax =plt.subplots(1,2,figsize=[15,12])
sns.heatmap(CVotes_M,cmap = "Greens", fmt="d", annot=True, ax=ax[0])
sns.heatmap(CVotes_F,cmap = "Greens", fmt="d", annot=True, ax=ax[1])
ax[0].set_title('Heatmap for CVotes - Males')
ax[1].set_title('Heatmap for CVotes - Females')
plt.show()

### Inferences: A few inferences that can be seen from the heatmap above is that males have voted more than females, and Sci-Fi appears to be most popular among the 18-29 age group irrespective of their gender. What more can you infer from the two heatmaps that you have plotted? Write your three inferences/observations below:

##### Inference 1: 'Sci-Fi' genre is most popular among the 18-29 and 30-44 age group irrespective of their gender.
##### Inference 2: 'Action' genre is the second most popular among the 18-29 group for males and 'Adventure' genre is the second most popular among the 18-29 group for females
##### Inference 3: Age group under 18, irrespective of their gender, has the lowest number of votes received compared to the other age groups.

In [None]:
#creating pivot_table for heat map of the average number of votes of males across the genres 

Votes_M=pd.pivot_table (data=genre_top_10, index='Genre',values=("VotesU18M","Votes1829M","Votes3044M","Votes45AM"))

#creating pivot_table for heat map of the average number of votes of females across the genres

Votes_F=pd.pivot_table (data=genre_top_10, index='Genre',values=("VotesU18F","Votes1829F","Votes3044F","Votes45AF"))

In [None]:
Votes_M

In [None]:
Votes_F

In [None]:
# 2nd set of heat maps for Votes-related columns

fig, ax =plt.subplots(1,2,figsize=[15,12])
sns.heatmap(Votes_M,cmap = "Greens", fmt=".2f", annot=True, ax=ax[0])
sns.heatmap(Votes_F,cmap = "Greens", fmt=".2f", annot=True, ax=ax[1])
ax[0].set_title('Heatmap for Votes - Males') # Setting the title of the heatmap for males
ax[1].set_title('Heatmap for Votes - Females') # Setting the title of the heatmap for females
plt.show()

### Inferences: Sci-Fi appears to be the highest rated genre in the age group of U18 for both males and females. Also, females in this age group have rated it a bit higher than the males in the same age group. What more can you infer from the two heatmaps that you have plotted? Write your three inferences/observations below:

##### Inference 1: Votes from U18 from both F and M have higher ratings in comparsion to ther age groups for most genre. This could be due to two factors - one the young are less critical than older, or the U18 age group in a hurry to rate and rates most movies a similar rating.
##### Inference 2: Crime genre has got 8.3 rating, ie almost same rating as SciFi genre from U18. This could be a concern and movie censor boards should look at appropriate content Rating.
##### Inference 3: Rating by 45A, both M and F is the least amongst all age groups. Probably being critcal as mentioned in point 1 could be the reason

### Subtask 3.4: US vs non-US Cross Analysis
The dataset contains both the US and non-US movies. Let's analyse how both the US and the non-US voters have responded to the US and the non-US movies.

1. Create a column IFUS in the dataframe movies. The column IFUS should contain the value "USA" if the Country of the movie is "USA". For all other countries other than the USA, IFUS should contain the value non-USA.

2. Now make a boxplot that shows how the number of votes from the US people i.e. CVotesUS is varying for the US and non-US movies. Make use of the column IFUS to make this plot. Similarly, make another subplot that shows how non US voters have voted for the US and non-US movies by plotting CVotesnUS for both the US and non-US movies. Write any of your two inferences/observations from these plots.

3. Again do a similar analysis but with the ratings. Make a boxplot that shows how the ratings from the US people i.e. VotesUS is varying for the US and non-US movies. Similarly, make another subplot that shows how VotesnUS is varying for the US and non-US movies. Write any of your two inferences/observations from these plots.

Note : Use movies dataframe for this subtask. Make use of this documention to format your boxplot - https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
## Let's analyze our original DataFrame again

movies.columns

In [None]:
## We have to see the Country Column for the next task.

movies.Country.value_counts()

In [None]:
movies.Country.value_counts(normalize=True)

In [None]:
## Lets create a new column "IFUS" to distinguish the Country of the movies(USA & non-USA)

movies['IFUS'] = movies['Country'].apply(lambda x : "USA" if x=='USA' else 'non-USA')

movies.head()

In [None]:
movies.columns

In [None]:
## Now we have a new Column IFUS - Lets analyse the No. of Votes from (US & non-US region) for USA & Non-USA movies.
## Votes for US & Non-US Voters are analyzed side by side.

## Box Plot-1 : CVotesUS & CVotesnUS with IFUS

fig, ax = plt.subplots(1,2,figsize=[10,8])
sns.boxplot(x="IFUS", y="CVotesUS", data=movies, ax=ax[0])
sns.boxplot(x="IFUS", y="CVotesnUS", data=movies, ax=ax[1])
ax[0].set_title('Votes by US Voters')    # Setting the Title for boxplot for CVotesUS
ax[1].set_title('Votes by non-US Voters')  # Setting the Title for boxplot for CVotesnUS
plt.show()

### **Inferences**: Write your two inferences/observations below:

**Inference 1**: First boxplot for votes by US voters, we see that the median is higher(approx.50000) for USA movies and the median is lower(i.e. in the range 45000-48000) for the non-USA movies. We also observe few outliers in the number of votes by USA people for the USA movies whereas there are no outliers for the non-USA movies. Also, the upper hinge i.e. 75th percentile and the lower hinge i.e 25th percentile is more for USA movies compared to non-USA movies.

**Inference 2**: Second boxplot for votes by Non-US voters, we see that the median is slightly high for USA movies compared to non-USA movies. Also, the 75th percentile is approximately similar for both USA and non-USA movies. Hence, no of votes received are almost same for USA and non-USA movies for 75th percentile of non-US voters.

In [None]:
## Lets now analyze the Ratings given by US & non-US voters to USA & non-USA movies.

## Box Plot - 2 : VotesUS(y) vs IFUS(x)

fig, ax = plt.subplots(1,2,figsize=[10,8])
sns.boxplot(x="IFUS", y="VotesUS", data=movies, ax=ax[0])
sns.boxplot(x="IFUS", y="VotesnUS", data=movies, ax=ax[1])
ax[0].set_title('Ratings from US Voters')    # Setting Title for BoxPlot for VotesUS
ax[1].set_title('Ratings from non-US Voters')  # Setting Title for BoxPlot for VotesnUS
plt.show()

### **Inferences**: Write your two inferences/observations below:

**Inference 1**: First Box plot reflects that median of USA movies(8.0) is higher than that of Non-USA Movies(7.9), which eflects that US people vote for US based movies more than that of Non-USA Movies. 75 percentile of Non-USA movie votes is equal to the median of US-Movie Votes and 25 percentile for both are the same.

**Inference 2**: Second Box plot reflects that the median(7.8) is high for USA movies compared to the median(~7.7) for non-USA movies. This shows that 50th percentile of non-US voters rate more for USA movies compared to non-USA movies. Also, the lower extreme (fence) for ratings for non-USA movies is on the higher end compared to USA movies.

## **Subtask 3.5: Top 1000 Voters Vs Genres**

You might have also observed the column CVotes1000. This column represents the top 1000 voters on IMDb and gives the count for the number of these voters who have voted for a particular movie. Let's see how these top 1000 voters have voted across the genres.

1. Sort the dataframe genre_top10 based on the value of CVotes1000in a descending order.

2. Make a seaborn barplot for genre vs CVotes1000.

3. Write your inferences. You can also try to relate it with the heatmaps you did in the previous subtasks.

In [None]:
## Lets analyze the CVotes1000 column

genre_top_10.CVotes1000

In [None]:
### Sort the genre_top_10 by CVotes1000

genre_top_10.sort_values(by="CVotes1000", ascending=False, inplace=True)

genre_top_10.CVotes1000

In [None]:
genre_top_10.head()

In [None]:
genre_top_10=genre_top_10.reset_index()
genre_top_10=genre_top_10.rename(columns={"index": "genres"})

genre_top_10

In [None]:
### Lets plot the Bar Plot

plt.figure(figsize=[10,8])
sns.barplot(data=genre_top_10, x="Genre", y="CVotes1000")
plt.show()

### **Inferences**: 

1. Sci-Fi is the popular genre among the top 1000 voters.
2. Sci-Fi appears to be the popular genre and highest rated genre in different age groups irrespective of the gender when related to the heatmaps plotted above. Hence, Sci-Fi genre can be considered for making new movies.
3. Sci-Fi, Action and Thriller are the top 3 genres among the top 1000 voters.
4. Drama, Animation and Romance are the bottom 3 genres which are unpopular among the top 1000 voters.

## Final Inferences on the Movies Data Set for insights in - What movie to make so it perfoms best.

1. Sci-Fi is the most popular genre among all type of age groups as well as top 1000 voters.
2. Duration of the movies should be atmost 120 minutes as analyzed earlier.
3. Top 3 pairs of three actors to be included are [Leonardo DiCaprio, Tom Hardy, Joseph Gordon-Levitt], [Jennifer Lawrence, Peter Dinklage, Hugh Jackman], [Christian Bale, Joseph Gordon-Levitt]


## !!! We can clearly say that if a Sci-Fi movie of 2 hours length is to be made with the above mentioned Trio will work best in USA as well as non-USA markets!!!