# Practical Data Transformation and Analysis with Pandas
### Zong-han, Xie

# Speaker

* Zong-han, Xie

* Majored in physics

* Previously a C++ developer developing simulation software for LCD.

* Currently working for Micron Memory Taiwan, building home-made B.I. system.

* Email: icbm0926@gmail.com

# Origin and intension of this talk
* Running complex SQL code on a SQL server creates heavy loading to the server.
* It is hard to write omplex analysis into a SQL code, you might want to write those code in Python.
* Intended fo people have built a B.I. system on top of Python enviroment ex: Django, flask, Pandas...etc.
* Who wants to exploit the power of Python and transfer their reporting services from SQL-related services (ex: SSRS) to Python.
* This talk mainly focues on the components related to split-apply-combine strategy which is main method for transforming and analyzing tabular data.

# Outline
* Basic Data Structures: 
  - Create Pandas DataFrame
  - Read our demo data
  - Indexing in DataFrame and Pandas Series
  - SettingsWithCopy Warning
* Text Handling with Pandas
  - Using "str" attributes to handle string and using regexp with it
* Merging and Concatenating tables
  - Concept of merging two tables (inner join, left/right join, outer join)
  - Concatenating tables
* Split-Apply-Combine strategy
  - Process Flow
  - GroupBy object
  - GroupBy.transform
  - GroupBy.apply
  - GroupBy.aggregate
* A small example

# Creating a DataFrame

In [2]:
import pandas as pd
pd.DataFrame({'Column_A': ['A1', 'A2', 'A3'], 'Column_B': ['A1', 'A2', 'A3']}
             , index=['a', 'b', 'c'])

Unnamed: 0,Column_A,Column_B
a,A1,A1
b,A2,A2
c,A3,A3


# Create a DataFrame from numpy array

In [3]:
import numpy as np
np_array = np.array(np.random.random((5,2)))
pd.DataFrame(np_array, columns=['Column_A', 'Column_B'])

Unnamed: 0,Column_A,Column_B
0,0.555838,0.041948
1,0.173801,0.113569
2,0.186822,0.660871
3,0.749326,0.321476
4,0.753274,0.836023


# Create a new column for DataFrame

In [4]:
import numpy as np
np_array = np.array(np.random.random((5,2)))
df = pd.DataFrame(np_array, columns=['Column_A', 'Column_B'])
df['column_C'] = pd.Series([1,2,3,4,5])
print(df)

   Column_A  Column_B  column_C
0  0.409892  0.576824         1
1  0.519409  0.506883         2
2  0.912697  0.265496         3
3  0.682447  0.162641         4
4  0.434585  0.999400         5


# Create DataFrame from a csv file

In [24]:
users_df = pd.read_csv("./ml-1m/users.dat"
                    , delimiter='::'
                    , header=None
                    , names=["UserID", "Gender", "Age", "Occupation", "Zip-code"], engine='python')
users_df.ix[0:3]

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460


In [6]:
movies_org_df = pd.read_csv("./ml-1m/movies.dat"
                            , sep='::'
                            , header=None
                            , names=["MovieID", "Title", "Genres"], engine='python')
movies_org_df.ix[0:3]

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama


In [7]:
import datetime as dt
ratings_df = pd.read_csv("./ml-1m/ratings.dat"
                         , sep='::'
                         , header=None
                         , names=["UserID", "MovieID", "Rating", "Timestamp"], engine='python')
ratings_df['rating_dt'] = pd.to_datetime(ratings_df['Timestamp'],unit='s')
ratings_df.ix[0:3]

Unnamed: 0,UserID,MovieID,Rating,Timestamp,rating_dt
0,1,1193,5,978300760,2000-12-31 22:12:40
1,1,661,3,978302109,2000-12-31 22:35:09
2,1,914,3,978301968,2000-12-31 22:32:48
3,1,3408,4,978300275,2000-12-31 22:04:35


# Let's talk about DataFrame Indexing

# Remember to input a list when you want multiple colum at once

In [8]:
users_df[['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']]

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
5,6,F,50,9,55117
6,7,M,35,1,06810
7,8,M,25,12,11413
8,9,M,25,17,61614
9,10,F,35,1,95370


In [9]:
print(users_df['Occupation'][0:10])

0    10
1    16
2    15
3     7
4    20
5     9
6     1
7    12
8    17
9     1
Name: Occupation, dtype: int64


In [18]:
users_df['Occupation'].ix[0:10]

0     10
1     16
2     15
3      7
4     20
5      9
6      1
7     12
8     17
9      1
10     1
Name: Occupation, dtype: int64

# Change data in DataFrame

In [28]:
part_users_df = users_df.ix[[2,4,8]]
part_users_df['Gender'][2] = 'F'
print(part_users_df['Gender'][2])

F


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [57]:
part_users_df = users_df.ix[[2,4,8]].copy()
part_users_df['Gender'].ix[2] = 'F'
print(part_users_df)

   UserID Gender  Age  Occupation Zip-code
2       3      F   25          15    55117
4       5      M   25          20    55455
8       9      M   25          17    61614


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [30]:
part_users_df = users_df.ix[[2,4,8]].copy()
part_users_df.ix['Gender', 2] = 'F'
print(part_users_df)

        UserID Gender Age  Occupation Zip-code
2          3.0      M  25        15.0    55117
4          5.0      M  25        20.0    55455
8          9.0      M  25        17.0    61614
Gender     NaN    NaN   F         NaN      NaN


# SettingsWithCopy and chained indexing


# Text Handling

In [18]:
movies_org_df.ix[0:10]

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# Split Genres and put into rows

In [33]:
rows = []
for _, row in movies_org_df.iterrows():
    for gen in row.Genres.split('|'):
        rows.append([row['MovieID'], row['Title'], gen])
movies_df = pd.DataFrame(rows, columns=movies_org_df.columns)

# Merge Data and Concat Data

* Pandas.merge and Pandas.concat

* Pandas.merge is analog to join in SQL (inner, left, rught, outer join)

* Pandas.concat is analog to "union all" in SQL.

# Merge the demo data together

In [36]:
merged_df = ratings_df.merge(users_df
                             , on=['UserID']
                             , how='left') \
                      .merge(movies_df
                             , on=['MovieID']
                             , how='left')
merged_df.ix[0:10]

Unnamed: 0,UserID,MovieID,Rating,Timestamp,rating_dt,Gender,Age,Occupation,Zip-code,Title,Genres
0,1,1193,5,978300760,2000-12-31 22:12:40,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Animation
2,1,661,3,978302109,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Children's
3,1,661,3,978302109,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Musical
4,1,914,3,978301968,2000-12-31 22:32:48,F,1,10,48067,My Fair Lady (1964),Musical
5,1,914,3,978301968,2000-12-31 22:32:48,F,1,10,48067,My Fair Lady (1964),Romance
6,1,3408,4,978300275,2000-12-31 22:04:35,F,1,10,48067,Erin Brockovich (2000),Drama
7,1,2355,5,978824291,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Animation
8,1,2355,5,978824291,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Children's
9,1,2355,5,978824291,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Comedy


# Split - Apply - Combine Strategy

The basic cooncept of split-apply-combine strategy 

* Split
    - Split original data into groups

* Apply
    - Apply functions to data within each group independently

* Combine
    - Merge results into a data structure

# Data Aggregation

```
select 
max(Ratings.Rating)
from Ratings
group by Ratings.MovieId
```

```
df.assign(max_ratings = ratings_df.groupby("MovieID")['Rating'].transform(np.max))
```

* Data aggregation in Pandas uses GroupBy.apply, GroupBy.transform and GroupBy.aggregate.
* These functions are badly documented in Pandas dicumentation.

# GroupBy objects

In [8]:
merged_df.groupby(["Occupation", "Genres"])

<pandas.core.groupby.DataFrameGroupBy object at 0x7fde459c29b0>

In [9]:
grouped_ratings = merged_df.groupby(["Occupation", "Genres"])
for key, group_df in grouped_ratings:
    print("group keys: " + str(key))
    print(group_df.iloc[:5])
    break

group keys: ('K-12 student', 'Action')
     UserID  MovieID  Rating  Timestamp           rating_dt Gender  Age  \
10        1     1197       3  978302268 2000-12-31 22:37:48      F    1   
14        1     1287       5  978302039 2000-12-31 22:33:59      F    1   
90        1     2692       4  978301570 2000-12-31 22:26:10      F    1   
93        1      260       4  978300760 2000-12-31 22:12:40      F    1   
104       1     2028       5  978301619 2000-12-31 22:26:59      F    1   

    Zip-code    Occupation                                      Title  Genres  
10     48067  K-12 student                 Princess Bride, The (1987)  Action  
14     48067  K-12 student                             Ben-Hur (1959)  Action  
90     48067  K-12 student           Run Lola Run (Lola rennt) (1998)  Action  
93     48067  K-12 student  Star Wars: Episode IV - A New Hope (1977)  Action  
104    48067  K-12 student                 Saving Private Ryan (1998)  Action  


# Let's do Split-Apply-Combine manually

In [10]:
results = {'Occupation': [], 'Genres': [], 'Rating_mean':[]}
grouped_ratings = merged_df.groupby(["Occupation", "Genres"])  # split
for key, group_df in grouped_ratings:
    results['Occupation'].append(key[0])
    results['Genres'].append(key[1])
    results['Rating_mean'].append(group_df.Rating.mean())  # apply
pd.DataFrame(results).ix[0:10]  # combine

Unnamed: 0,Genres,Occupation,Rating_mean
0,Action,K-12 student,3.497116
1,Adventure,K-12 student,3.425658
2,Animation,K-12 student,3.463956
3,Children's,K-12 student,3.220679
4,Comedy,K-12 student,3.4972
5,Crime,K-12 student,3.687085
6,Documentary,K-12 student,3.581633
7,Drama,K-12 student,3.782167
8,Fantasy,K-12 student,3.298039
9,Film-Noir,K-12 student,4.212766


In [30]:
import numpy as np
tmp = merged_df[merged_df.Occupation == 'K-12 student'].copy()
tmp.loc[:,'Rating_mean'] = tmp.groupby(["Occupation", "Genres"])['Rating'].transform(np.mean)
print(tmp[['Occupation', 'Genres', 'Rating_mean']].sort_values(by='Genres').iloc[1:10])

           Occupation  Genres  Rating_mean
804564   K-12 student  Action     3.497116
363454   K-12 student  Action     3.497116
1806560  K-12 student  Action     3.497116
772492   K-12 student  Action     3.497116
1806556  K-12 student  Action     3.497116
659674   K-12 student  Action     3.497116
706974   K-12 student  Action     3.497116
827117   K-12 student  Action     3.497116
363460   K-12 student  Action     3.497116


** GroupBy.transform() returns a Pandas Series with the same index as those in original DataFrame **

** Therefore, it's easy to combine data back to the original data. **

In [12]:
print(tmp[['Occupation', 'Genres', 'Rating_mean']].sort_values(by='Genres').drop_duplicates().iloc[0:10])

           Occupation       Genres  Rating_mean
418625   K-12 student       Action     3.497116
831363   K-12 student    Adventure     3.425658
136334   K-12 student    Animation     3.463956
453973   K-12 student   Children's     3.220679
789324   K-12 student       Comedy     3.497200
692103   K-12 student        Crime     3.687085
1931149  K-12 student  Documentary     3.581633
790387   K-12 student        Drama     3.782167
521989   K-12 student      Fantasy     3.298039
364059   K-12 student    Film-Noir     4.212766


In [13]:
merged_df.groupby(['Occupation', 'Genres'])['Rating'].mean()

Occupation         Genres     
K-12 student       Action         3.497116
                   Adventure      3.425658
                   Animation      3.463956
                   Children's     3.220679
                   Comedy         3.497200
                   Crime          3.687085
                   Documentary    3.581633
                   Drama          3.782167
                   Fantasy        3.298039
                   Film-Noir      4.212766
                   Horror         3.237795
                   Musical        3.556738
                   Mystery        3.636612
                   Romance        3.624415
                   Sci-Fi         3.443795
                   Thriller       3.554131
                   War            3.880144
                   Western        3.513333
academic/educator  Action         3.392063
                   Adventure      3.424278
                   Animation      3.693399
                   Children's     3.459286
                   Come

In [14]:
merged_df.groupby(['Occupation', 'Genres'])['Rating'].agg(np.mean).ix['K-12 student']

Genres
Action         3.497116
Adventure      3.425658
Animation      3.463956
Children's     3.220679
Comedy         3.497200
Crime          3.687085
Documentary    3.581633
Drama          3.782167
Fantasy        3.298039
Film-Noir      4.212766
Horror         3.237795
Musical        3.556738
Mystery        3.636612
Romance        3.624415
Sci-Fi         3.443795
Thriller       3.554131
War            3.880144
Western        3.513333
Name: Rating, dtype: float64

In [15]:
merged_df.groupby(['Occupation', 'Genres'])['Rating'].agg(np.mean).reset_index().ix[0:5]

Unnamed: 0,Occupation,Genres,Rating
0,K-12 student,Action,3.497116
1,K-12 student,Adventure,3.425658
2,K-12 student,Animation,3.463956
3,K-12 student,Children's,3.220679
4,K-12 student,Comedy,3.4972
5,K-12 student,Crime,3.687085


In [27]:
merged_df.groupby(['Occupation', 'Genres']).agg({'Rating': np.mean}).reset_index().ix[0:5]

Unnamed: 0,Occupation,Genres,Rating
0,K-12 student,Action,3.497116
1,K-12 student,Adventure,3.425658
2,K-12 student,Animation,3.463956
3,K-12 student,Children's,3.220679
4,K-12 student,Comedy,3.4972
5,K-12 student,Crime,3.687085


In [39]:
merged_df.groupby(['Occupation', 'Genres']).agg({'Rating': pd.Series.quantile}).reset_index()

AttributeError: module 'pandas' has no attribute 'quantile'