# Objective
Here, no specific problem will be answered. But I'll explore the IDMb dataset and answer different random questions:
<ul>
    <li>Which year was the best in term of movie created?</li>
    <li>Which countries are the most represented in the top 10 most rated countries?</li>
    <li>Which countries create the most movie all year combined?</li>
    <li>Which genre is the most represented among all mivies?</li>
</ul>
I only focus on the year 2000 to 2020. And get rid of all the NaN values.
<br>
<br>
The goal is to practice with a big dataset using Python.

### Set up the environment

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import seaborn as sns
import plotly.figure_factory as ff
from collections import Counter
from wordcloud import WordCloud

### Discover the dataset

The dataset counts 85855 rows and 22 columns.
<br>
The movie data set is represented in 112 unique years ranging from 1894 to 2020.

In [None]:
df = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv', low_memory = False)
df.head(2)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

<b>imdb_title_id</b> is the primary key column which means its values are unique.
<br>
We should have unique titles and original titles as well.
<br>
Let's check if the data is not dirty.

In [None]:
df.duplicated(subset = 'imdb_title_id').sum() #check if there's duplicate

In [None]:
df.duplicated(subset = 'title').sum()

In [None]:
df.duplicated(subset = 'original_title').sum()

In [None]:
df.nunique()

# Clean dataset

In order to successfully answer the questions, we'll have to clean this dataset:
<ul>
    <li>Drop all the columns with a lot of NaN values</li>
    <li>Drop unwanted columns</li>
    <li>Drop all NaN values</li>
</ul>
We're gonna get rid of all year less than the year 2000. Why get rid of all years below 2000? Personal decision. I'm not a big fan of old movies so let's stay in the 21th century (no offense really).

In [None]:
df_clean = df[['title','year', 'genre', 'duration', 'country', 'director', 'writer', 'production_company', 'actors', 'avg_vote']] #keep the wanted column

In [None]:
df_clean.head(2)

Let's re check the new data frame for potential dirty data

In [None]:
df_clean.isna().sum() #find all the NaN values

In [None]:
df_clean = df_clean.dropna() #drop all the rows with NaN values
df_clean.isna().sum()

Remove the row where the value for column 'year' is 'TV Movie 2019'

In [None]:
df_clean = df_clean[df_clean.year != 'TV Movie 2019']

In [None]:
df_clean['year'] = df_clean['year'].astype('int')

In [None]:
#drop years below 2000
indexNames = df_clean[df_clean['year'] < 2000].index

In [None]:
df_clean.drop(indexNames, inplace = True)

Separate the countries so we can analyze them separately. 
<br> 
This piece of code will be used later on (in the section "top countries creating movies"), but since it's part of preparing the data, I put it here.

In [None]:
countries = {} #create empty dictionary
list_countries = list(df_clean['country']) #set up a list of countries
for i in list_countries:
    i = list(i.split(',')) #split countries separated by commas
    if len(i) == 1: #if 1 country in countries.keys()
        if i in list(countries.keys()): #countries.keys() is the country name, countries.values() is the count of country name
            countries[i] +=1 #count
        else:
            countries[i[0]] = 0
    else:
        for j in i: #does the same but for len(i) != 1
            if j in list(countries.keys()):
                countries[j] += 1 #count
            else:
                countries[j] =1

Separate the genre so we can analyze them separately
<br>
This piece of code will be used later on (in the section "distribution of genre"), but since it's part of preparing the data, I put it here.

In [None]:
genre = list(df_clean['genre'])
genre_list = [] #create an empty list

for i in genre:
    i = list(i.split(',')) #split words when comma
    for j in i:
        genre_list.append(j.replace(' ', '')) #replace extra space
        
g = Counter(genre_list)

In [None]:
df_clean.shape

In [None]:
df_clean.describe()

# Analysis

### Distribution per year

2017 was the year where the most movies were released. While 2020 was the least. This can be probably explained by a bad year for the cinema due to the covid-19.
<br>
2019 and 2020 are the worst year so far.
<br>
From the 2000, movies released have been in positive growth mostly. The year 2001, 2009 and 2017 have seen a small decrease in the number of movies released. 2018 and 2019 have been the worst years so far for the cinema
<br> The year 2020 have been a disaster but this is explained by the on going covid-19 pandemic.

In [None]:
#groupby year and count how many title we have each year
title = df_clean.groupby('year').agg({'title': ['count']})
title.columns = ['Title Count']
title = title.sort_values('Title Count', ascending = False)
title.head(5)

In [None]:
#visualize with a bar graph in descending title count order 
fig_dim = (10,5)
fig, ax = plt.subplots(figsize = fig_dim)
sns.countplot(x = 'year', data = df_clean, order = df_clean['year'].value_counts().index)
plt.title('Title count per year, asending sorting per number of title released')

In [None]:
#visualize with a plot graph in ascending year order
fig_dim = (10,5)
fig, ax = plt.subplots(figsize = fig_dim)
ax = sns.countplot(x = 'year', data = df_clean)
plt.title('Title count per year')

#display the count on the bar graph
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x(), p.get_height() + 20))

Get the growth year to year (in %)

In [None]:
title = title.sort_index() #sort by year. The year is the index
growth = title.pct_change() #calculate the %growth year by year
growth.columns = ['% growth']
growth.head(3)

In [None]:
plt.figure(figsize = (15, 8))
x = growth['% growth']
y = growth.index
plt.bar(y, x)
plt.plot(y, x, color = 'red', linewidth = 2.0)
plt.title('Movie growth per year')
plt.xlabel('Year')
plt.ylabel('% growth')

### Get the top rated movies

USA and India and UK are represented the most in the top 10 rated movie from 2000 and 2020.
<br>
2002 seems to be the year where the average rating was the best. While 2020 was the worst. There are 2 explanations:
<ul>
    <li>2020 is a bad year for the cinema, so it has less movies compare to other year</li>
    <li>2002 is (without 2020) the 2nd year where the least movies were released</li>
</ul>
Most of the ratings are distributed around 6 out of 10. Which gives an overall average of 5.70.
<br>
<br>

Top 10 rated movie with the year and the country

In [None]:
df_clean = df_clean.sort_values(by = 'avg_vote', ascending = False)

In [None]:
rating = df_clean[0:10] #get the 1st 10 values
fig = px.sunburst(rating, path = ['year', 'country', 'title'], values = 'avg_vote', color = 'avg_vote')
fig.show()

Which year was the best year?

In [None]:
best = df_clean.groupby('year').agg({'avg_vote': ['mean']})
best.columns = ['Average rating']
best = best.sort_values('Average rating', ascending = False)
best.head()

In [None]:
plt.figure(figsize = (15, 8))
x = best['Average rating'].round(decimals = 3)
y = best.index
plt.bar(y, x)
plt.title('Average rating per year')
plt.xlabel('Year')
plt.ylabel('Average rating (out of 10)')

Visualize the distribution of the ratings

In [None]:
average = df_clean['avg_vote'].mean() #define the average rating for all years

In [None]:
fig_dim = (10,7)
fig, ax = plt.subplots(figsize = fig_dim)
sns.histplot(df_clean['avg_vote'], ax=ax, kde = True) #plot distribution
plt.axvline(average, color = 'red', label = 'Average rating') #plot the average
plt.legend()
plt.title('Distribution of the ratings')
plt.show()

### Top countries creating movies

USA, France and Germany are the top 3 countries releasing the most movies since 2000.
<br>
These 3 countries represent 55.3% of released movies in the top 10. 

In [None]:
countries_fin = {} #create an empty dictionary
for country, no in countries.items(): 
    country = country.replace(' ', '') #remove the extra space create by the split(',')
    if country in list(countries_fin.keys()):
        countries_fin[country] += no
    else:
        countries_fin[country] = no

#sort the country according to their count
countries_fin = {k: v for k, v in sorted(countries_fin.items(), key = lambda item: item[1], reverse= True)}

In [None]:
plt.figure(figsize = (8,8))
ax = sns.barplot(x = list(countries_fin.keys())[0:10], y = list(countries_fin.values())[0:10])
plt.title('Top 10 countries creating movies')

#display the count on the bar graph
for p in ax.patches:
    ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x() + 0.1, p.get_height() + 20))
#the two '+' help to position the value on the bar graph

In [None]:
#create the dataframe
df_c = pd.DataFrame(list(countries_fin.items()), columns = ['Country', 'Country count'])

#plot a pie chart of the distribution of each country per region
fig,ax = plt.subplots()
x = df_c['Country count'][:10] #get the top 10 countries
labels = df_c['Country'][:10] #get the top 10 countries
ax.pie(x, labels = labels, radius = 2)

#create a white circle at the center of the pie to create a donut chart
my_circle = plt.Circle( (0,0), 0.7, color = 'white')
p = plt.gcf()
p.gca().add_artist(my_circle)

plt.show()

### Duration distribution

Most of the movies are among the duration 75 to 125 minutes.
<br>
Averagely, a movie last between 100 to 105 minutes (from 2000 to 2020) which is fairly stable through 21 years. With 2019 and 2020 being the year where the average year was the highest.
<br>
The minimum duration for a movie has a range from 42 to 50 minutes. While the range of maximum duration start from 202 to 808 minutes. Yes yes, there was a movie in 2016 which lasted 808 minutes, or 13.46 hours.

In [None]:
fig_dim = (10,5)
fig, ax = plt.subplots(figsize = fig_dim)
sns.histplot(df_clean['duration'], ax=ax, kde = True)
ax.set_xlim(0,300) #limit the duration to 300 to focus where there are the most concentration
plt.title('Distribution of the duration of movies')
plt.show()

Let's see in the form of a table more detailed information about the duration

In [None]:
year = df_clean.groupby('year').agg({'duration': [np.max, np.min, np.mean]})
year.columns = ['MAX duration', 'MIN duration', 'MEAN duration']
year = year.sort_values('MAX duration', ascending = False)
year.head()

In 2016, the duration of a movie was 808 minutes, or 13.46 hours. Let's find out about this movie.

In [None]:
df_clean.loc[df_clean['duration'] == 808]

In [None]:
plt.figure(figsize = (15, 8))
x = year['MEAN duration'].round(decimals = 3)
y = year.index
plt.bar(y, x)
plt.title('Average duration of movies per year')
plt.xlabel('Year')
plt.ylabel('Average duration (minute)')

### Distribution of Genre

Drama is by far the genre the most represented from 2000 to 2020. While Sci-fi movies, my favorite genre, are not well represented.

In [None]:
g = {k: v for k, v in sorted(g.items(), key=lambda item: item[1], reverse= True)}

In [None]:
fig_dim = (10,7)
fig, ax = plt.subplots(figsize = fig_dim)
x = list(g.keys())
y = list(g.values())
ax.vlines(x, ymin = 0, ymax = y, linewidth = 4)
plt.xticks(rotation = 90)
plt.show()

Create a wordcloud visualization of genre.
<br>
Help: https://www.python-graph-gallery.com/wordcloud/
<br>
Since Drama and Comedy are the most represented genre, the wordcloud show them as big.

In [None]:
#create list of genre (unique genre)
text = list(g.keys())

#create the wordcloud object
wordcloud = WordCloud(width = 500, height = 500, max_words = 100000, background_color = 'white').generate(str(text))

#Display the generated image
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.margins(x = 0, y = 0)
plt.show()