**This notebook is an exercise in the [Data Visualization](https://www.kaggle.com/learn/data-visualization) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/final-project).**

---


Now it's time for you to demonstrate your new skills with a project of your own!

In this exercise, you will work with a dataset of your choosing.  Once you've selected a dataset, you'll design and create your own plot to tell interesting stories behind the data!

## Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

In [None]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex7 import *
print("Setup Complete")

## Step 1: Attach a dataset to the notebook

Begin by selecting a CSV dataset from [Kaggle Datasets](https://www.kaggle.com/datasets).  If you're unsure how to do this or would like to work with your own data, please revisit the instructions in the previous tutorial.

Once you have selected a dataset, click on the **[+ Add Data]** option in the top right corner.  This will generate a pop-up window that you can use to search for your chosen dataset.  

![ex6_search_dataset](https://i.imgur.com/cIIWPUS.png)

Once you have found the dataset, click on the **[Add]** button to attach it to the notebook.  You can check that it was successful by looking at the **Data** dropdown menu to the right of the notebook -- look for an **input** folder containing a subfolder that matches the name of the dataset.

<center>
<img src="https://i.imgur.com/nMYc1Nu.png" width=30%><br/>
</center>

You can click on the carat to the left of the name of the dataset to double-check that it contains a CSV file.  For instance, the image below shows that the example dataset contains two CSV files: (1) **dc-wikia-data.csv**, and (2) **marvel-wikia-data.csv**.

<center>
<img src="https://i.imgur.com/B4sJkVA.png" width=30%><br/>
</center>

Once you've uploaded a dataset with a CSV file, run the code cell below **without changes** to receive credit for your work!

In [None]:
# Check for a dataset with a CSV file
step_1.check()

## Step 2: Specify the filepath

Now that the dataset is attached to the notebook, you can find its filepath.  To do this, begin by clicking on the CSV file you'd like to use.  This will open the CSV file in a tab below the notebook.  You can find the filepath towards the top of this new tab.  

![ex6_filepath](https://i.imgur.com/fgXQV47.png)

After you find the filepath corresponding to your dataset, fill it in as the value for `my_filepath` in the code cell below, and run the code cell to check that you've provided a valid filepath.  For instance, in the case of this example dataset, we would set
```
my_filepath = "../input/fivethirtyeight-comic-characters-dataset/dc-wikia-data.csv"
```  
Note that **you must enclose the filepath in quotation marks**; otherwise, the code will return an error.

Once you've entered the filepath, you can close the tab below the notebook by clicking on the **[X]** at the top of the tab.

In [None]:
# Fill in the line below: Specify the path of the CSV file to read
my_filepath = '../input/netflix-shows/netflix_titles.csv'

# Check for a valid filepath to a CSV file in a dataset
step_2.check()

## Step 3: Load the data

Use the next code cell to load your data file into `my_data`.  Use the filepath that you specified in the previous step.

In [None]:
# Fill in the line below: Read the file into a variable my_data
my_data = pd.read_csv(my_filepath)

# Check that a dataset has been uploaded into my_data
step_3.check()

**_After the code cell above is marked correct_**, run the code cell below without changes to view the first five rows of the data.

In [None]:
# Print the first five rows of the data
my_data.head()

## Step 4: Visualize the data

Use the next code cell to create a figure that tells a story behind your dataset.  You can use any chart type (_line chart, bar chart, heatmap, etc_) of your choosing!

In [None]:
# Create a plot
# 1. Correlation between type and release_year
movie_data = my_data[my_data['type']=='Movie'].groupby(by='release_year').count()
# movie_data
tv_show_data = my_data[my_data['type']=='TV Show'].groupby(by='release_year').count()
# tv_show_data

plt.figure(figsize=(14,6))
sns.lineplot(data=movie_data['type'],label='Movie')
sns.lineplot(data=tv_show_data['type'],label='TV_Show')

plt.xlabel('release_year')

In [None]:
# 2. 
plt.figure(figsize=(30,10))
movies_country=my_data[my_data['type']=='Movie'].groupby('country').count()
tv_shows_country=my_data[my_data['type']=='TV_Show'].groupby('country').count()

sns.lineplot(data=movies_country,x=movies_country.index,y='type')
sns.lineplot(data=tv_shows_country,x=tv_shows_country.index,y='type')

In [None]:
# 3. which rating category has the max number of movies
plt.figure(figsize=(18,8))
rating_movies=my_data[my_data['type']=='Movie'].groupby('rating').count()
rating_tv_shows=my_data[my_data['type']=='TV_Show'].groupby('rating').count()

sns.barplot(x=rating_movies.index,y='type',data=rating_movies)

## Keep going

Learn how to use your skills after completing the micro-course to create data visualizations in a **[final tutorial](https://www.kaggle.com/alexisbcook/creating-your-own-notebooks)**.

In [None]:
# 4. dataframe for movies only
movie_data_df = my_data[my_data['type']=='Movie']

# list of genres
replace_lst=[]
for (i,j) in movie_data_df.iterrows():
    replace_lst.append(j['listed_in'].replace('&',',').replace(' ','').split(','))

genre_lst=[]
for lst_of_lsts in replace_lst:
    for lst in lst_of_lsts:
        genre_lst.append(lst)

#  dataframe created from the list of genres       
movie_genre_list = pd.DataFrame(genre_lst, columns=['genres'])
movie_genre_list.drop(movie_genre_list[movie_genre_list['genres']=='Movies'].index, inplace=True)
movie_genre_list_count= movie_genre_list.value_counts()

movie_genre_list_df = pd.DataFrame(movie_genre_list_count, columns=['count'])
movie_genre_list_df.reset_index(level=0, inplace=True)
movie_genre_list_df

#  barplot for number of movies under each genre
plt.figure(figsize=(40,14))
sns.barplot(x=movie_genre_list_df['genres'], y=movie_genre_list_df['count'])

In [None]:
# 5. movie that has the max duration
elem_lst=[]

for (k,l) in movie_data_df.iterrows():
    elem_lst.append(l['duration'].replace(' min',''))
# elem_lst
int_lst = [int(elem) for elem in elem_lst]
# int_lst


In [None]:
# movie_data_df['duration_in_mins'] = pd.Series(int_lst)
# # movie_data_df
# nw_df = pd.DataFrame(movie_data_df[['title','duration_in_mins']])
# print(nw_df)
# plt.figure(figsize=(40,14))
# sns.lineplot(data=nw_df, x='title',y='duration_in_mins')

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161291) to chat with other Learners.*