# Practical Data Transformation and Analysis with Pandas
   ## Zong-han, Xie <icbm0926@gmail.com>

# Outline
1. Understanding basic components: 
  - Introducing Pandas Series and DataFrame
  - Indexing in Pandas and things about "SettingWithCopyWarning"
2. Text Handling with Pandas
  - Using "str" attributes to handle string.
  - Using regular expression with Pandas.
3. Merging and Concatenating tables
  - Concept of merging two tables (inner join, left/right join, outer join)
  - Concatenating tables
4. Split-Apply-Combine strategy
  - Dataframe.groupby
  - Data transform with transform and agg functions
5. Demo using Split-Apply-Combine strategy to aggregate data and Q&A

In [1]:
# %load Extract_MovieLens_Data.py


# # This notebook is to extract data from Movie Lens
# * The data contents are explained in http://files.grouplens.org/papers/ml-1m-README.txt
# 
# ## users.dat
# 
# UserID::Gender::Age::Occupation::Zip-code
# - Gender is denoted by a "M" for male and "F" for female
# - Age is chosen from the following ranges:
# 
# 	*  1:  "Under 18"
# 	* 18:  "18-24"
# 	* 25:  "25-34"
# 	* 35:  "35-44"
# 	* 45:  "45-49"
# 	* 50:  "50-55"
# 	* 56:  "56+"
# 
# - Occupation is chosen from the following choices:
# 
# 	*  0:  "other" or not specified
# 	*  1:  "academic/educator"
# 	*  2:  "artist"
# 	*  3:  "clerical/admin"
# 	*  4:  "college/grad student"
# 	*  5:  "customer service"
# 	*  6:  "doctor/health care"
# 	*  7:  "executive/managerial"
# 	*  8:  "farmer"
# 	*  9:  "homemaker"
# 	* 10:  "K-12 student"
# 	* 11:  "lawyer"
# 	* 12:  "programmer"
# 	* 13:  "retired"
# 	* 14:  "sales/marketing"
# 	* 15:  "scientist"
# 	* 16:  "self-employed"
# 	* 17:  "technician/engineer"
# 	* 18:  "tradesman/craftsman"
# 	* 19:  "unemployed"
# 	* 20:  "writer"
# 
# ## movies.dat
# MovieID::Title::Genres
# 
# ## ratings.dat
# UserID::MovieID::Rating::Timestamp

# In[1]:

import pandas as pd


# In[3]:

users_df = pd.read_csv("./ml-1m/users.dat"
                    , sep='::'
                    , header=None
                    , names=["UserID", "Gender", "Age", "Occupation", "Zip-code"])
ocupation_codes = {'ocupation_code': [x for x in range(21)]
                   , 'Occupation_name': ["other or not specified", "academic/educator", "artist"
                                  , "clerical/admin", "college/grad student", "customer service"
                                  , "doctor/health care", "executive/managerial", "farmer"
                                  , "homemaker", "K-12 student", "lawyer", "programmer", "retired"
                                  , "sales/marketing" ,"scientist", "self-employed", "technician/engineer"
                                  , "tradesman/craftsman", "unemployed", "writer"]
                  }
ocupation_codes = pd.DataFrame(ocupation_codes)
users_df = users_df.merge(ocupation_codes, left_on=["Occupation"], right_on=["ocupation_code"], how='left')
users_df = users_df.drop(["Occupation", "ocupation_code"], axis=1).rename(columns={'Occupation_name': 'Occupation'})


# In[4]:

movies_org_df = pd.read_csv("./ml-1m/movies.dat"
                            , sep='::'
                            , header=None
                            , names=["MovieID", "Title", "Genres"])
rows = []
for _, row in movies_org_df.iterrows():
    for gen in row.Genres.split('|'):
        rows.append([row['MovieID'], row['Title'], gen])
movies_df = pd.DataFrame(rows, columns=movies_org_df.columns)


# In[5]:

ratings_df = pd.read_csv("./ml-1m/ratings.dat"
                         , sep='::'
                         , header=None
                         , names=["UserID", "MovieID", "Rating", "Timestamp"])
ratings_df['rating_dt'] = pd.to_datetime(ratings_df['Timestamp'],unit='s')





In [2]:
print(users_df.columns)
print(movies_df.columns)
print(ratings_df.columns)

Index(['UserID', 'Gender', 'Age', 'Zip-code', 'Occupation'], dtype='object')
Index(['MovieID', 'Title', 'Genres'], dtype='object')
Index(['UserID', 'MovieID', 'Rating', 'Timestamp', 'rating_dt'], dtype='object')


In [3]:
df = ratings_df.merge(users_df, on=['UserID'], how='left').merge(movies_df, on=['MovieID'], how='left')

# DataFrame Indexing

In [24]:
part_users_df = users_df.ix[[2,4,8]]
part_users_df[2, 'Gender'] = 'F'  # This one will fail
print(part_users_df)

   UserID Gender  Age Zip-code           Occupation (2, Gender)
2       3      M   25    55117            scientist           F
4       5      M   25    55455               writer           F
8       9      M   25    61614  technician/engineer           F


In [25]:
part_users_df = users_df.ix[[2,4,8]]
part_users_df.loc[2, 'Gender'] = 'F'
print(part_users_df)

   UserID Gender  Age Zip-code           Occupation
2       3      F   25    55117            scientist
4       5      M   25    55455               writer
8       9      M   25    61614  technician/engineer


In [26]:
part_users_df.set_index("UserID")

Unnamed: 0_level_0,Gender,Age,Zip-code,Occupation
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,F,25,55117,scientist
5,M,25,55455,writer
9,M,25,61614,technician/engineer


# Text Handling

In [28]:
movies_org_df.ix[0:10]

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [31]:
movies_org_df.ix[0:10].Genres.str.split("|")

0      [Animation, Children's, Comedy]
1     [Adventure, Children's, Fantasy]
2                    [Comedy, Romance]
3                      [Comedy, Drama]
4                             [Comedy]
5            [Action, Crime, Thriller]
6                    [Comedy, Romance]
7              [Adventure, Children's]
8                             [Action]
9        [Action, Adventure, Thriller]
10            [Comedy, Drama, Romance]
Name: Genres, dtype: object

In [37]:
movies_org_df.ix[0:10].Genres.str.partition("|")

Unnamed: 0,0,1,2
0,Animation,|,Children's|Comedy
1,Adventure,|,Children's|Fantasy
2,Comedy,|,Romance
3,Comedy,|,Drama
4,Comedy,,
5,Action,|,Crime|Thriller
6,Comedy,|,Romance
7,Adventure,|,Children's
8,Action,,
9,Action,|,Adventure|Thriller


In [38]:
movies_org_df.ix[0:10].Genres.str.rpartition("|")

Unnamed: 0,0,1,2
0,Animation|Children's,|,Comedy
1,Adventure|Children's,|,Fantasy
2,Comedy,|,Romance
3,Comedy,|,Drama
4,,,Comedy
5,Action|Crime,|,Thriller
6,Comedy,|,Romance
7,Adventure,|,Children's
8,,,Action
9,Action|Adventure,|,Thriller


In [36]:
movies_df[movies_df.Title.str.contains('^Old.*')]

Unnamed: 0,MovieID,Title,Genres
1309,797,"Old Lady Who Walked in the Sea, The (Vieille q...",Comedy
1657,1012,Old Yeller (1957),Children's
1658,1012,Old Yeller (1957),Drama
1791,1085,"Old Man and the Sea, The (1958)",Adventure
1792,1085,"Old Man and the Sea, The (1958)",Drama


# Merge Data and Concat Data
* Pandas.merge and Pandas.concat
* Pandas.merger is analog to join in SQL
* 

# Split - Apply - Combine Strategy

* Split
    - Split data into groups
* Apply
    - Apply functions to data within each group independently
    - Functions including trnasform, aggregation, filtration

* Combine
    - Merge results into a data structure


# Data Aggregation

```
select 
max(Ratings.Rating)
from Ratings
group by Ratings.MovieId
```

```
df.assign(max_ratings = ratings_df.groupby("MovieID")['Rating'].transform(np.max))
```

* Data aggregation in Pandas uses GroupBy.apply, GroupBy.transform and GroupBy.aggregate.
* These functions are badly documented in Pandas dicumentation.

# GroupBy.transform

In [None]:
df.groupby(["Occupation", "Genres"])['Rating'].transform(lambda x: x.mean()).iloc[:10]

GroupBy.transform accepts a function argement which takes a Pandas Series or a Pandas DataFrame and it returns aggregated data in Pandas Series or DataFrame form with the same index as the input data.