# Data Visualization with Python
---






## What and Why Data Visualization? 
![alt text](https://blog.hubspot.com/hs-fs/hub/53/file-2576089202-png/00-Blog_Thinkstock_Images/data-visualization-examples.png)


One of the key skills of a data scientist is the ability to tell a compelling story. The human brain processes visual information much better and more quickly than text.

Visualizing data helps extract information, better understand the data, and make effective decisions. 

The main goal of this tutorial is to help you get started with making visualizations in Python.

We will go over the landscape of Python Visualization tools, and then use matplotlib, seaborn, and folium to teach you how to take data that at first glance has little meaning and expose the underlying trends and correlations in a form that makes sense to people.

## Python Data Visualization Landscape


![Python Landscape](https://rougier.github.io/python-visualization-landscape/landscape-colors.png)


# Basic Visualizations



##Line Plot

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields.
Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

##Area Plot
An area plot also known as an area chart or graph is a type of plot that depicts accumulated totals using numbers or percentages over time. It is based on the line plot and is commonly used when trying to compare two or more quantities

## Histogram
A histogram is a way of representing the frequency distribution of a numeric dataset. The way it works is it partitions the spread of the numeric data into bins, assigns each datapoint in the dataset to a bin, and then counts the number of datapoints that have been assigned to each bin. So the vertical axis is actually the frequency or the number of datapoints in each bin.

## Bar Chart 
A bar chart is a very popular visualization tool. Unlike a histogram, a bar chart also known as a bar graph is a type of plot where the length of each bar is proportional to the value of the item that it represents. It is commonly used to compare the values of a variable at a given point in time. 

## Waffle Chart
A waffle chart is a great way to visualize data in relation to a whole or to highlight progress against a given threshold.

 ## Word Cloud
 A word cloud is simply a depiction of the importance of different words in the body of text. A word cloud works in a simple way; the more a specific word appears in a source of textual data the bigger and bolder it appears in the world cloud.

# Data Preparation
To visualize and plot data, we have to prepare our data source.
We are using Pandas, a popular data structure library to load, explore, and process our data.



In [None]:
import datetime as dt
import matplotlib as mpl
import matplotlib.pyplot as plt 
import pandas as pd 
pd.options.display.float_format = '{:.2f}'.format 
pd.options.display.min_rows = 50

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
df = pd.read_csv('movies_metadata.csv') #load data

In [None]:
df.head(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()/df.shape[0]*100 # check for percentage of missing data by column


# Pointed Questions
### 1. What is the average budget by original language?

In [None]:
#convert to correct data types. budget to number
df.budget = pd.to_numeric(df.budget, errors = 'coerce')
df.popularity = pd.to_numeric(df.popularity, errors = 'coerce')


In [None]:
df['original_language'] = df['original_language'].astype('str')

In [None]:
df.original_language.unique()

In [None]:
df = df[df['original_language'] != '68.0']
df = df[df['original_language'] != '82.0']
df = df[df['original_language'] != '104.0']

In [None]:
df.original_language.unique()

In [None]:
df_test = df[['original_language','budget']]
df_test.head(10)
df_grp = df_test.groupby(['original_language'], as_index = False).mean()
df_grp.head(10)

In [None]:
plt.figure(figsize = (15,8))
df_grp['budget'].plot(kind = 'bar')
plt.title('bar chart')
plt.ylabel('budget')
plt.xlabel('language')
plt.show();

### 2. What is the frequency distribution of the runtime?

In [None]:
df['runtime'].plot(kind = 'hist')
plt.title('Histogram of runtime of movies')
plt.ylabel('number of movies')
plt.xlabel('runtime (in minutes)')
plt.show();

### 3. The era of the 90s RomCom. Was that really a thing?

In [None]:
df.columns

In [None]:
df[['original_title', 'genres', 'popularity', 'vote_average', 'release_date']].head()

In [None]:
def is_rom_com(genres):
    if ('Romance' in genres) and ('Comedy' in genres):
        return 1
    else:
        return 0

# data processing
# - convert release_date str to datetime object
# - identify genres assigned to a movie
# - drop non-meaningful values in release_date
df = df[df['release_date'] != '1']
df = df[df['release_date'] != '12']
df = df[df['release_date'] != '22']
df.dropna(subset=['release_date'], inplace=True)
df['is_rom_com'] = df['genres'].apply(lambda x: is_rom_com([genre['name'] for genre in eval(x)]))
df['release_year'] = df['release_date'].apply(lambda y: dt.datetime.strptime(y, '%Y-%m-%d').year)
rom_coms_by_year = df.groupby(['release_year'])['is_rom_com'].sum()

In [None]:
rom_coms_by_year.tail()

In [None]:
figsize = (15,8)
plt.figure(1, figsize=figsize)
plt.plot(rom_coms_by_year.index, rom_coms_by_year.values)
plt.title('Number of romantic comedy films over time')
plt.ylabel('count')
plt.xlabel('Number of films');


### 4. Is there a relationship between a movie's rating and the amount of revenue it generates?

In [None]:
plt.figure(2, figsize=figsize)
plt.plot(df['vote_average'], df['revenue'], '.')
plt.title('Revenue by ratings')
plt.ylabel('Revenue')
plt.xlabel('Average vote');

In [None]:
plt.figure(3, figsize=figsize)
plt.bar(df['vote_average'], df['revenue'])
plt.title('Revenue by ratings')
plt.ylabel('Revenue')
plt.xlabel('Average vote');

# Potential questions
Are foreign langauge films higher rated?

#References + Resources
- [Python Graphy Gallery](https://python-graph-gallery.com/). A website that features a collection of graphs and reproducible python code to generate those graphs. 
- [Pyviz](http://pyviz.org/). An open platform dedicated to helping users decide on the best open-source (OSS) Python data visualization tools. 
- [Matplotlib](http://aosabook.org/en/matplotlib.html). An article about the history and architecture of Matplotlib, written by its founders. 
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas
- The dataset is assembled by user rounakbanik on Kaggle. Learn more about her project [here](https://www.kaggle.com/rounakbanik/the-movies-dataset). 
- [Python Grids](http://www.pythongrids.org/grids/g/data-visualization/) is a website that compares stats on 14 Python plotting libraries.
