<a href="https://colab.research.google.com/github/Data-Science-and-Data-Analytics-Courses/UCSanDiegoX---Python-for-Data-Science-03-Jan-2019-audit-/blob/master/Week%2004%20Pandas/Pandas/Introduction%20to%20Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Pandas</p><br>

*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data

**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

Let's get started with our first *pandas* notebook!

# Set up Google Colaboratory (run the following code cell one time only)

In [0]:
import os

PULL_URL = "http://github.com/Data-Science-and-Data-Analytics-Courses/UCSanDiegoX---Python-for-Data-Science-03-Jan-2019-audit-"

# Clone GitHub repository to local repository
!git clone $PULL_URL
os.chdir(os.path.basename(PULL_URL)) # change working directory to local repository directory
LOCAL_REPO = os.getcwd()

# Change working directory to notebook's directory
%cd Week 04 Pandas/Pandas

# Pandas

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Import Libraries
</p>

In [0]:
import pandas as pd

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Introduction to pandas Data Structures</p>
<br>
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*. 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas Series</p>

*pandas Series* one-dimensional labeled array. 


In [0]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])

In [0]:
ser

In [0]:
ser.index

In [0]:
ser.loc[['nancy','bob']]

In [0]:
ser[[4, 3, 1]]

In [0]:
ser.iloc[2]

In [0]:
'bob' in ser

In [0]:
ser

In [0]:
ser * 2

In [0]:
ser[['nancy', 'eric']] ** 2

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas DataFrame</p>

*pandas DataFrame* is a 2-dimensional labeled data structure.

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from dictionary of Python Series</p>

In [0]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [0]:
df = pd.DataFrame(d)
print(df)

In [0]:
df.index

In [0]:
df.columns

In [0]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

In [0]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from list of Python dictionaries</p>

In [0]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [0]:
pd.DataFrame(data)

In [0]:
pd.DataFrame(data, index=['orange', 'red'])

In [0]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Basic DataFrame operations</p>

In [0]:
df

In [0]:
df['one']

In [0]:
df['three'] = df['one'] * df['two']
df

In [0]:
df['flag'] = df['one'] > 250
df

In [0]:
three = df.pop('three')

In [0]:
three

In [0]:
df

In [0]:
del df['two']

In [0]:
df

In [0]:
df.insert(2, 'copy_of_one', df['one'])
df

In [0]:
df['one_upper_half'] = df['one'][:2]
df

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Case Study: Movie Data Analysis</p>
<br>This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:
* **Data Source:** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. 

Let us look at the files in this dataset using the UNIX command ls.


In [0]:
import requests, zipfile, io

# Download dataset
url = "http://files.grouplens.org/datasets/movielens/ml-20m.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.infolist()

In [0]:
# Note: Adjust the name of the folder to match your local directory

# Extract data files
data_filenames = ["ratings.csv", "tags.csv", "movies.csv"]
for fpath in z.namelist():
  # Look for file names of interest
  if os.path.basename(fpath) in data_filenames:
    # Extract a single file from zip
    z.extract(fpath)
    
# Rename extracted directory
!mv ml-20m movielens
!ls
!ls movielens

In [0]:
!cat ./movielens/movies.csv | wc -l

In [0]:
!head -5 ./movielens/ratings.csv

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Use Pandas to Read the Dataset<br>
</p>
<br>
In this notebook, we will be using three CSV files:
* **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp*
* **tags.csv :** *userId*,*movieId*, *tag*, *timestamp*
* **movies.csv :** *movieId*, *title*, *genres* <br>

Using the *read_csv* function in pandas, we will ingest these three files.

In [0]:
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

In [0]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

In [0]:
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

In [0]:
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

<h1 style="font-size:2em;color:#2467C0">Data Structures </h1>

<h1 style="font-size:1.5em;color:#2467C0">Series</h1>

In [0]:
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

In [0]:
print(row_0)

In [0]:
row_0.index

In [0]:
row_0['userId']

In [0]:
'rating' in row_0

In [0]:
row_0.name

In [0]:
row_0 = row_0.rename('first_row')
row_0.name

<h1 style="font-size:1.5em;color:#2467C0">DataFrames </h1>

In [0]:
tags.head()

In [0]:
tags.index

In [0]:
tags.columns

In [0]:
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

<h1 style="font-size:2em;color:#2467C0">Descriptive Statistics</h1>

Let's look how the ratings are distributed! 

In [0]:
ratings['rating'].describe()

In [0]:
ratings.describe()

In [0]:
ratings['rating'].mean()

In [0]:
ratings.mean()

In [0]:
ratings['rating'].min()

In [0]:
ratings['rating'].max()

In [0]:
ratings['rating'].std()

In [0]:
ratings['rating'].mode()

In [0]:
ratings.corr()

In [0]:
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

In [0]:
filter_2 = ratings['rating'] > 0
filter_2.all()

<h1 style="font-size:2em;color:#2467C0">Data Cleaning: Handling Missing Data</h1>

In [0]:
movies.shape

In [0]:
#is any row NULL ?

movies.isnull().any()

Thats nice ! No NULL values !

In [0]:
ratings.shape

In [0]:
#is any row NULL ?

ratings.isnull().any()

Thats nice ! No NULL values !

In [0]:
tags.shape

In [0]:
#is any row NULL ?

tags.isnull().any()

We have some tags which are NULL.

In [0]:
tags = tags.dropna()

In [0]:
#Check again: is any row NULL ?

tags.isnull().any()

In [0]:
tags.shape

Thats nice ! No NULL values ! Notice the number of lines have reduced.

<h1 style="font-size:2em;color:#2467C0">Data Visualization</h1>

In [0]:
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

In [0]:
ratings.boxplot(column='rating', figsize=(15,20))

<h1 style="font-size:2em;color:#2467C0">Slicing Out Columns</h1>
 

In [0]:
tags['tag'].head()

In [0]:
movies[['title','genres']].head()

In [0]:
ratings[-10:]

In [0]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

In [0]:
tag_counts[:10].plot(kind='bar', figsize=(15,10))

<h1 style="font-size:2em;color:#2467C0">Filters for Selecting Rows</h1>

In [0]:
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

In [0]:
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

In [0]:
movies[is_animation].head(15)

<h1 style="font-size:2em;color:#2467C0">Group By and Aggregate </h1>

In [0]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [0]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

In [0]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [0]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

<h1 style="font-size:2em;color:#2467C0">Merge Dataframes</h1>

In [0]:
tags.head()

In [0]:
movies.head()

In [0]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>


Combine aggreagation, merging, and filters to get useful analytics
</p>

In [0]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

In [0]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

In [0]:
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

In [0]:
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

In [0]:
box_office[is_comedy & is_highly_rated][-5:]

<h1 style="font-size:2em;color:#2467C0">Vectorized String Operations</h1>


In [0]:
movies.head()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Split 'genres' into multiple columns

<br> </p>

In [0]:
movie_genres = movies['genres'].str.split('|', expand=True)

In [0]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Add a new column for comedy genre flag

<br> </p>

In [0]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [0]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Extract year from title e.g. (1995)

<br> </p>

In [0]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [0]:
movies.tail()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
<br> </p>

<h1 style="font-size:2em;color:#2467C0">Parsing Timestamps</h1>

Timestamps are common in sensor data or other time series datasets.
Let us revisit the *tags.csv* dataset and read the timestamps!


In [0]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

In [0]:
tags.dtypes

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Unix time / POSIX time / epoch time records 
time in seconds <br> since midnight Coordinated Universal Time (UTC) of January 1, 1970
</p>

In [0]:
tags.head(5)

In [0]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

</p>

In [0]:

tags['parsed_time'].dtype

In [0]:
tags.head(2)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Selecting rows based on timestamps
</p>

In [0]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Sorting the table using the timestamps
</p>

In [0]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

<h1 style="font-size:2em;color:#2467C0">Average Movie Ratings over Time </h1>
## Are Movie ratings related to the year of launch?

In [0]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

In [0]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

In [0]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

In [0]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Do some years look better for the boxoffice movies than others? <br><br>

Does any data point seem like an outlier in some sense?

</p>