<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Pandas</p><br>

*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* builds upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data

**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

Let's get started with our first *pandas* notebook!

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Import Libraries
</p>

In [None]:
import pandas as pd

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Introduction to pandas Data Structures</p>
<br>
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*. 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas Series</p>

*pandas Series* is a one-dimensional labeled array. 


In [None]:
ser = pd.Series(data=[100, 200, 300, 400, 500], index=['tom', 'bob', 'nancy', 'dan', 'eric'])

In [None]:
ser

In [None]:
ser.index

In [None]:
ser['nancy']

In [None]:
ser[[4, 3, 1]]

In [None]:
'bob' in ser

In [None]:
ser

In [None]:
ser * 2

In [None]:
ser ** 2

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas DataFrame</p>

*pandas DataFrame* is a 2-dimensional labeled data structure.

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from dictionary of Python Series</p>

In [None]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [None]:
df = pd.DataFrame(d)
print(df)

In [None]:
df.index

In [None]:
df.columns

In [None]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

In [None]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from list of Python dictionaries</p>

In [None]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [None]:
pd.DataFrame(data)

In [None]:
pd.DataFrame(data, index=['orange', 'red'])

In [None]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Basic DataFrame operations</p>

In [None]:
df

In [None]:
# indexing by a column returns that column as a Series

df['one']

In [None]:
# similar to dictionaries, when indexing DataFrames by a new key (or column), a new column is created in the DataFrame

df['three'] = df['one'] * df['two']
df

In [None]:
df['flag'] = df['one'] > 250
df

In [None]:
# .pop method returns the popped column as a panda Series - can store in a variable

three = df.pop('three')

In [None]:
three

In [None]:
df

In [None]:
# del deletes specified column without returning anything

del df['two']

In [None]:
df

In [None]:
# can insert new columns by specified location using .insert method. arguments: (loc, column name, value)

df.insert(2, 'copy_of_one', df['one'])
df

In [None]:
# when copying new columns in a DataFrame, you can also specify how much data you want from the source copied by slicing
# the data using indexing. values not copied show as NaN

df['one_upper_half'] = df['one'][:2]
df

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Case Study: Movie Data Analysis</p>
<br>This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

### Please note that **you will need to download the dataset**. 

Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:
* **Data Source:** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called **movielens** in your **Week-4-pandas** folder. 

Let us look at the files in this dataset using the UNIX command ls.


In [None]:
# Note: Adjust the name of the folder to match your local directory

!ls ./movielens

In [None]:
!cat ./movielens/movies.csv

# Note the csv has labels for the data columns (movieId,title,genres) - if these weren't there, the data would be indexed
# numerically

In [None]:
# We can check the line count of the csv file by piping into the wc filter and specifying -l for lines -> this is the number
# of movies in our movie database

!cat ./movielens/movies.csv | wc -l

In [None]:
# to check quickly if the other csv files contain data and to see how it's formatted we can use the unix head command

!head -n5 ./movielens/links.csv
print()
!head -n5 ./movielens/genome-scores.csv
print()
!head -n5 ./movielens/genome-tags.csv
print()
!head -n5 ./movielens/ratings.csv
print()
!head -n5 ./movielens/tags.csv

In [None]:
# Now that we know the data is good - we can start loading it in to DataFrames

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Use Pandas to Read the Dataset<br>
</p>
<br>
In this notebook, we will be using three CSV files:
* **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp*
* **tags.csv :** *userId*,*movieId*, *tag*, *timestamp*
* **movies.csv :** *movieId*, *title*, *genres* <br>

Using the *read_csv* function in pandas, we will ingest these three files.

In [None]:
# below, movies is a pandas DataFrame type object. using the .read_csv method we pass in arguments for file and seperator

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))

# .head is a DataFrame method that displays the first 5 rows by default. We can pass in a different number to get back
# a different number of rows of data, e.g. .head(15)
movies.head()

In [None]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

In [None]:
# Read further below on section "Parsing Timestampes" to learn more on timestamps

ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

In [None]:
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

<h1 style="font-size:2em;color:#2467C0">Data Structures </h1>

<h1 style="font-size:1.5em;color:#2467C0">Series</h1>

In [None]:
#Extract 0th row: notice that it is infact a Series
# .iloc method is used to extract a given row in a DataFrame - returns a Series

row_0 = tags.iloc[0]
type(row_0)

In [None]:
print(row_0)

In [None]:
# the Series method .index  returns the index labels for the series

row_0.index

In [None]:
# indexing the series by the index label returns the value for that label

row_0['userId']

In [None]:
# rating is not an index label for the Series, so below returns False

'rating' in row_0

In [None]:
# .name method returns the name of the returned Series - this can be renamed using the .rename method as below

row_0.name

In [None]:
row_0 = row_0.rename('first_row')
row_0.name

<h1 style="font-size:1.5em;color:#2467C0">DataFrames </h1>

In [None]:
tags.head()

In [None]:
# .index in a DataFrame returns information on the rows (index data)

tags.index

In [None]:
# .columns method returns information on the column headers

tags.columns

In [None]:
# below extracts rows 0, 11, 2000 from the DataFrame using the .iloc method and indexing accordingly - returns an array
# containing the expected values

tags.iloc[ [0,11,2000] ]

<h1 style="font-size:2em;color:#2467C0">Descriptive Statistics</h1>

Let's look how the ratings are distributed! 

In [None]:
# .describe method shows summary statistics of the DataFrame, e.g. mean, s.d. etc.
# can be used to help figure out if something's wrong with your data - e.g. if max or min values out of the expected range

ratings['rating'].describe()

In [None]:
ratings['rating'].mean()

In [None]:
ratings.mean()

In [None]:
ratings['rating'].min()

In [None]:
ratings['rating'].max()

In [None]:
ratings['rating'].std()

In [None]:
ratings['rating'].mode()

In [None]:
# .corr method can be used to show correlation between the data using the pairwise Pearson coefficient

ratings.corr()

In [None]:
# below attempts to create a boolean DataFrame where ratings values are > 5 and assignd to a filter variable
# we use the .any method to check if any values exist in the variable - > returns False, which means our ratings data is
# in line with what we expect

# Pandas any() method is applicable both on Series and DataFrame. It checks whether any value in the caller object 
# (Dataframe or series) is not 0 and returns True for that. If all values are 0, it will return False

filter_1 = ratings['rating'] > 5
print(type(filter_1))
filter_1.any()

In [None]:
# the .all method can be used on Series or DataFrame as well. It returns whether all elements are True, potentially over an
# axis. Returns True unless there at least one element within a series or along a Dataframe axis that is False or
# equivalent (e.g. zero or empty)

filter_2 = ratings['rating'] > 0
filter_2.all()

<h1 style="font-size:2em;color:#2467C0">Data Cleaning: Handling Missing Data</h1>

In [None]:
# .shape method yields the number of rows and columns in the DataFrame

movies.shape

In [None]:
# .isnull method detects missing values for an array-like object.

# This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None 
# or NaN in object arrays, NaT in datetimelike). Returns a bool array per each element of data indicating True or False

# .any is then run to check if any True values are contained in the bool array -> all False indicits no null values

movies.isnull().any()

That's nice! No NULL values!

In [None]:
ratings.shape

In [None]:
#is any row NULL ?

ratings.isnull().any()

That's nice! No NULL values!

In [None]:
tags.shape

In [None]:
#is any row NULL ?

# .isnull() returns some True values in the bool array - and thus .any() returns True for the column in which the True
# values were returned (column 'tag' in this case)

tags.isnull().any()

We have some tags which are NULL.

In [None]:
# we can then use the .dropna() method to drop the rows which contain null values

tags = tags.dropna()

In [None]:
# checking .isnull() again confirms that the null values have been removed

tags.isnull().any()

In [None]:
# we can then rerun the .shape method to compare how many rows remain compared to the original shape. We can see
# that only 16 rows have been removed

tags.shape

That's nice! No NULL values! Notice the number of lines have decreased.

<h1 style="font-size:2em;color:#2467C0">Data Visualization</h1>

In [None]:
# Pandas leverages matplotlib underneath for its plotting capabilities

# if you want Jupyter to plot the graphs *inside the notebook*, you have to tell Jupyter to plot inline as we see here
# (the % symbol before matplotlib below is used for special class functions in Jupyter called 'magic functions')

%matplotlib inline

# .hist method returns a histogram. we need to pass in the data, and figsize to adjust the graph size- can also take other
# arguments, e.g. bin size etc.

ratings.hist(column='rating', figsize=(15,10))

In [None]:
# .boxplot method takes arguments similar to the .hist method except returns a boxplot rather than histogram

ratings.boxplot(column='rating', figsize=(12,8))

<h1 style="font-size:2em;color:#2467C0">Slicing Out Columns</h1>
 

In [None]:
tags['tag'].head()

In [None]:
# .head method can be performed for multiple columns as below

movies[['title','genres']].head()

In [None]:
# slicing rows out of the middle of a DataFrame can be done as below - returns all columns if not specified

ratings[1000:1010]

In [None]:
# alternatively we can grab from the end as below, etc.

ratings[-10:]

In [None]:
# the .value_counts method counts the number of rows within a given column (e.g. 'tag') per unique name
# this can be assigned to a variable, e.g. 'tag_counts' as below

tag_counts = tags['tag'].value_counts()
tag_counts[:10]

In [None]:
tag_counts[-10:]

In [None]:
# the assigned cariable can be subsequently plotted as below using the plot function. argument 'kind' specifies
# the type of plot you want

tag_counts[:10].plot(kind='bar', figsize=(12,9))

<h1 style="font-size:2em;color:#2467C0">Filters for Selecting Rows</h1>

In [None]:
# Filtering is a common functionality when you select data that matches a criteria. It is done in 2 steps:
# First we need to developt a filter that encodes our criteria (e.g. in to a boolean arrray)
# Then we apply that filter as a mask to our DataFrame:

# is_highly_rated is the filter which returns a boolean array given the specified criteria
is_highly_rated = ratings['rating'] >= 4.0

# we then apply the filter to our original DataFrame as a mask, which returns a new array of the filtered data
ratings[is_highly_rated][-5:]

In [None]:
# below uses the .str.contains method to filter by string - including only genres that contain the word 'animation'. 
# again, a boolean array is first created with the filter and assigned to variable is_animation
# the filter is then assigned (indexed) as a mask to the original DataFrame - and it produces a new DataFrame with the
# filtered data

is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

In [None]:
movies[is_animation].head(15)

<h1 style="font-size:2em;color:#2467C0">Group By and Aggregate </h1>

In [None]:
# Aggregating values across rows gives us a big picture about the whole dataset - so it can be very useful.
# we aggregate using the .groupby method.
# Once we have performed groupby, and assigned to a variable, we can use other functions such as count or mean to get
# the exact statistics we are looking for on the grouped data

# the argument in .groupby() is what the grouped data will be indexed by. As example - below uses the count function
# on data grouped by rating to work out the number of movies per given rating
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [None]:
# below groups (indexes) by movieId and calculates the average rating per movie. Note how 'movieId' and 'rating' are first
# selected from the ratings DataFrame

average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.tail()

In [None]:
average_rating.head()

In [None]:
# below uses count on the grouped data to work out the number of ratings in our database per movie
# as its grouped by 'movieId', this is what it's indexed by

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [None]:
# using .tail we see that some movies only have 1 rating each
movie_count.tail()

<h1 style="font-size:2em;color:#2467C0">Merge Dataframes</h1>

In [None]:
tags.head()

In [None]:
movies.head()

Note that the movies and tags dataframes both have the common column 'movieId' and thus can be inner merged by it

In [None]:
# Often when working with Dataframes we need to work with data from multiple frames. A common practice is to want to merge
# data we want from multiple frames in to a single frame, and then exeute operations on the new frame. This is similar to 
# the JOIN operaton in SQL. 

# the .concat method can be used to stack DataFrames and create a new Dataframe out of them. If a Dataframe is stacked 
# on to itself, the index for the resulting table will have row indexes from the original table preserved.
# If 2 Dataframes given to the cat function havve columns that are separate, the resulting Dataframe will have all columns
# from the 2 frames represented -> cells for the columns that didn't exist in the original Dataframe will have NaN values. 

# if argument join to the .concat method is specified to = 'inner', the 2 Dataframes are placed next to each other
# horizontally. This is good except for that columns can be duplicated - e.g. key.

# .append method can be used similarly to .concat, as df1.append(df2)  -> appends vertically

# To avoid duplicate columns and NaN values, the solution is to use the .merge method. .merge behaves very similarly to
# .concat with 'inner' join specified, except that it has the benefit of eliminating duplicate columns. This is very 
# useful when combining data with the same keys -> similar to working with relational databases

# .merge is demonstrated below, joining on 'movieId', specified as an inner join -> Notice how we merge both the tags 
# data and movies data in to one Dataframe
t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

<h1 style="font-family: Arial; font-size:10; color:#2462C0; font-style:bold">Combine aggregation, merging, and filters to get useful analytics</h>

In [None]:
ratings.head()

In [None]:
# By setting 'as_index' argument in the .groupby method to False, we ensure that 'movieId' is not the index -> a numerical
# index is created to the left of movieId

# .mean() is performed on the rating data, and returned per movieId which are grouped in to a single row

avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

In [None]:
movies.head()

In [None]:
# we can now merge these average ratings (stored in variable avg_ratings) above in to the movies table. As both Dataframes
# have the movieId column, this is what we merge on

box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.head()

In [None]:
# we can now apply filters on our merged set of data. remember this is done by creating a boolean array as the filter,
# and then apply this as a mask (by indexing) on to the main Dataframe. Our filter below is assigned to variable 
# 'is_highly_rated' and sets True in the boolean array to movies with a rating >= 4.0

is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

In [None]:
# similarly, we can filter using the .str.contains method to filter by a give string. Again, this returns a boolean array
# as a filter which is masked on to the main Dataframe

is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]

In [None]:
# finally, we can apply multiple filters on to our data at the same time. Only values for which are True in both boolean 
# arrays will be returned below, as we are using the & logical operator 

# the filter essentially only returns Comedy movings with a high rating

box_office[is_comedy & is_highly_rated][-5:]

<h1 style="font-size:2em;color:#2467C0">Vectorized String Operations</h1>


In [None]:
movies.head()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>
Split 'genres' into multiple columns
<br> </p>

In [None]:
# using the .str.split method will split a string by the given delimiter, and return this as an array of strings (e.g.
# a list), rather than just the string itself

# if you wish to create new columns in the DataFrame per string split, then you need to set the argument 
# 'expand' = true in the split method

movie_genres = movies['genres'].str.split('|', expand=True)

In [None]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>
Add a new column for comedy genre flag
<br> </p>

In [None]:
# the str.contains methdo provides a simple way to check if a string has a given character(s) in it. A boolean series
# is returned. 

# below adds column 'isComedy' to the movie_genres DataFrame which contains a boolean series provided from different
# Dataframe, 'movies', indicating whether each genre movie contains 'Comedy' or not 

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [None]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>
Extract year from title e.g. (1995)
<br> </p>

In [None]:
# the str.extract method returns the first match for a regular expression (re) it finds. 
# in below the extracted year string from 'title' is returned in new column 'years' in the movies Dataframe.
# with this new years column, data could be grouped by year to gain further insights on the data

# re refresher: '\' preceeding a symbol/character will search for that symbol/character rather than execute its usual
# functionality. e.g. in below, '\(' searches for '(' rather than performing its usual operation of selecting the 
# data you wish to be extracted. its usual operation is preformed after, ensuring only the information in the parantheses
# is extracted (e.g. the year) rather than extracting the parantheses as well

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [None]:
movies.tail()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

**Further string methods and more info here**: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
<br> </p>

<h1 style="font-size:2em;color:#2467C0">Parsing Timestamps</h1>

Timestamps are common in sensor data or other time series datasets.
Let us revisit the *tags.csv* dataset and read the timestamps!


In [None]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [None]:
# note the datatype for our timestamp column is 'int64' - this is the Unix time value

tags.dtypes

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Unix time / POSIX time / epoch time records 
time in seconds <br> since midnight Coordinated Universal Time (UTC) of January 1, 1970
</p>

In [None]:
tags.head(5)

In [None]:
# using the .to_datetime method of pandas we are able to convert Unix time into datetime format
# (e.g. Datetime64[Ns] or <M8[ns]) -> this presents the date in a human readable format

# the unit argument is important here as it tells the method what the unit of the input is

# the output below is stored in a new column named 'parsed_time'
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

</p>

In [None]:

tags['parsed_time'].dtype

In [None]:
tags.head(2)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Selecting rows based on timestamps
</p>

In [None]:
# now that the time is converted to Python format, we can use it to create filters. As with 
# other filters we've created, we assign a variable to the filter which returns a boolean array.
# the boolean array is then masked on to our data to return a filtered set of the original data
# according to the criteria set in our filter

# below only returns data in our Dataframe from after 1 Feb 2015
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

# by comparing the shape of our original data with our filtered data, we can see that only
# approx 15k/465k lines of data are from after 2 Feb 2015
tags.shape, selected_rows.shape

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Sorting the table using the timestamps
</p>

In [None]:
# using the .sort_values method we can then sort our timeseries data by the timestamps, either
# ascending (from the beginning of time) or descending (from the end of time going backwards)

# below displays the first 10 datapoints of our dataset - we see the original data started in 2005
tags.sort_values(by='parsed_time', ascending=True)[:10]

<h1 style="font-size:2em;color:#2467C0">Average Movie Ratings over Time</h1>

## Are Movie Ratings related to the Year of Launch?

In [None]:
# to answer this question, we first yield the average rating per each of our movies - this is 
# done by using the .groupby method to group the movieId's per rating, and then taking the mean 
# rating of each movie. this is then stored in a new array named average_rating

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.head()

In [None]:
# the average_rating array is then joined with the movies array by movieId. this adds on 
# the extra column 'rating' to the movies array which contains the average rating per movie

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
# joined.corr()

In [None]:
# to yield average rating per year we create a new array named yearly_average which is created
# using the 'year' & 'rating' columns from our joined table. this is grouped by year and paired
# against the mean of each rating per year

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[-15:]

In [None]:
# we can now plot these results below for the most recent 20 years of our data using the .plot
# method

# notice how after 2009 there is a spike in rating. this is likely an outlier or error in our
# data and requires further investigation. On further invesigation - we can see that year
# "2009-" comes after 2009, and it has 1 rating of 5 stars. this is what's causing the spike. 
# to fix this we need to clean the data and amend "2009-" to 2009.

yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">
Do some years look better for the box office movies than others? <br><br>
Does any data point seem like an outlier in some sense?

</p>

Above are examples of questions we can ask when analysing the plot of our data