## CMPINF 2110 Spring 2022 - Week 04

### Revisit the shoe example

## Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

## Read data

Read in the tidied long-format shoe data set.

In [2]:
lf = pd.read_csv('../data/shoes_long_format.csv')

In [3]:
lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   day       40 non-null     int64 
 1   shoe      40 non-null     object
 2   value     40 non-null     int64 
 3   location  40 non-null     object
dtypes: int64(2), object(2)
memory usage: 1.4+ KB


In [4]:
lf.head()

Unnamed: 0,day,shoe,value,location
0,1,W,12,N
1,2,W,5,N
2,3,W,9,N
3,4,W,4,N
4,1,B,5,N


We had defined the **observational unit** as the a number or count of a color of shoe entering a location on a day.

As a refresher we have the following unique values.

In [5]:
lf.nunique()

day          4
shoe         4
value       16
location     3
dtype: int64

In [6]:
lf.location.value_counts()

N    16
D    16
P     8
Name: location, dtype: int64

In [7]:
lf.shoe.value_counts()

W    10
B    10
R    10
O    10
Name: shoe, dtype: int64

In [8]:
lf.day.value_counts()

2    12
3    12
1     8
4     8
Name: day, dtype: int64

## Add information about the locations

I don't remember what D vs N vs P stands for in this example. Let's use the real names for the locations.

It's easier to work with a smaller data set focused on the locations when we make this change.

Essentially, let's create a DataFrame where 1 row is 1 location.

In [11]:
lf.groupby(['location']).nunique().reset_index()

Unnamed: 0,location,day,shoe,value
0,D,4,4,9
1,N,4,4,14
2,P,2,4,7


In [14]:
lf.groupby(['location']).size().reset_index(name='num_rows')

Unnamed: 0,location,num_rows
0,D,16
1,N,16
2,P,8


But we don't care about the number of rows at the moment, so we can drop it.

In [16]:
lf.groupby(['location']).size().reset_index(name='num_rows').drop(columns=['num_rows'])

Unnamed: 0,location
0,D
1,N
2,P


In [17]:
location_info = lf.groupby(['location']).size().reset_index(name='num_rows').drop(columns=['num_rows'])

In [18]:
location_info

Unnamed: 0,location
0,D
1,N
2,P


One row is one location and so now we can easily add information or **attributes** about each location.

In [19]:
location_info['location_name'] = pd.Series(['Dunkin Donuts', 
                                            'Noodles & Company',
                                            'Panera Bread'],
                                            index=location_info.index)

In [20]:
location_info

Unnamed: 0,location,location_name
0,D,Dunkin Donuts
1,N,Noodles & Company
2,P,Panera Bread


What if we wanted to know the address for each location?

In [21]:
location_info['address'] = pd.Series(['3907 Forbes Ave, Pittsburgh, PA 15123',
                                      '3805 Forbes Ave, Pittsburgh, PA 15123',
                                      '3800 Forbes Ave, Pittsburgh, PA 15123'],
                                     index=location_info.index)

In [22]:
location_info

Unnamed: 0,location,location_name,address
0,D,Dunkin Donuts,"3907 Forbes Ave, Pittsburgh, PA 15123"
1,N,Noodles & Company,"3805 Forbes Ave, Pittsburgh, PA 15123"
2,P,Panera Bread,"3800 Forbes Ave, Pittsburgh, PA 15123"


Split or SEPARATE a string to parse out context from a big string. 

Let's split the address into the street, city, state, and zipcode.

Split or separate by the comma pattern first. Include an additional space after the comma.

In [24]:
location_info.address.str.split('\, ')

0    [3907 Forbes Ave, Pittsburgh, PA 15123]
1    [3805 Forbes Ave, Pittsburgh, PA 15123]
2    [3800 Forbes Ave, Pittsburgh, PA 15123]
Name: address, dtype: object

In [25]:
location_info.address.str.split('\, ', expand=True)

Unnamed: 0,0,1,2
0,3907 Forbes Ave,Pittsburgh,PA 15123
1,3805 Forbes Ave,Pittsburgh,PA 15123
2,3800 Forbes Ave,Pittsburgh,PA 15123


In [26]:
location_info[['street_address', 'city', 'state_zip']] = location_info.address.str.split('\, ', expand=True)

In [27]:
location_info

Unnamed: 0,location,location_name,address,street_address,city,state_zip
0,D,Dunkin Donuts,"3907 Forbes Ave, Pittsburgh, PA 15123",3907 Forbes Ave,Pittsburgh,PA 15123
1,N,Noodles & Company,"3805 Forbes Ave, Pittsburgh, PA 15123",3805 Forbes Ave,Pittsburgh,PA 15123
2,P,Panera Bread,"3800 Forbes Ave, Pittsburgh, PA 15123",3800 Forbes Ave,Pittsburgh,PA 15123


Separate the state and zipcode on the white space.

In [28]:
location_info.state_zip.str.split(' ')

0    [PA, 15123]
1    [PA, 15123]
2    [PA, 15123]
Name: state_zip, dtype: object

In [29]:
location_info.state_zip.str.split(' ', expand=True)

Unnamed: 0,0,1
0,PA,15123
1,PA,15123
2,PA,15123


In [30]:
location_info[['state', 'zipcode']] = location_info.state_zip.str.split(' ', expand=True)

In [31]:
location_info

Unnamed: 0,location,location_name,address,street_address,city,state_zip,state,zipcode
0,D,Dunkin Donuts,"3907 Forbes Ave, Pittsburgh, PA 15123",3907 Forbes Ave,Pittsburgh,PA 15123,PA,15123
1,N,Noodles & Company,"3805 Forbes Ave, Pittsburgh, PA 15123",3805 Forbes Ave,Pittsburgh,PA 15123,PA,15123
2,P,Panera Bread,"3800 Forbes Ave, Pittsburgh, PA 15123",3800 Forbes Ave,Pittsburgh,PA 15123,PA,15123


Drop the unncessary information that we parsed by splitting the strings.

In [32]:
location_info.drop(columns=['address', 'state_zip'], inplace=True)

In [33]:
location_info

Unnamed: 0,location,location_name,street_address,city,state,zipcode
0,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
1,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123


We can JOIN the more detailed information to our TIDY data!

In [34]:
pd.merge(lf, location_info, on='location', how='left')

Unnamed: 0,day,shoe,value,location,location_name,street_address,city,state,zipcode
0,1,W,12,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
1,2,W,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,W,9,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
3,4,W,4,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
4,1,B,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
5,2,B,8,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
6,3,B,22,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
7,4,B,2,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
8,1,R,3,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
9,2,R,6,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123


I used a left join...would it have been wrong here to use a RIGHT JOIN?

In [35]:
lf.head()

Unnamed: 0,day,shoe,value,location
0,1,W,12,N
1,2,W,5,N
2,3,W,9,N
3,4,W,4,N
4,1,B,5,N


In [36]:
lf.tail()

Unnamed: 0,day,shoe,value,location
35,3,B,4,P
36,2,R,4,P
37,3,R,5,P
38,2,O,15,P
39,3,O,11,P


In [37]:
location_info

Unnamed: 0,location,location_name,street_address,city,state,zipcode
0,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
1,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123


What if we used a RIGHT JOIN instead of a LEFT JOIN?

In [38]:
pd.merge(lf, location_info, on='location', how='right')

Unnamed: 0,day,shoe,value,location,location_name,street_address,city,state,zipcode
0,1,W,9,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
1,2,W,9,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
2,3,W,2,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
3,4,W,5,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
4,1,B,8,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
5,2,B,3,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
6,3,B,11,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
7,4,B,8,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
8,1,R,2,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
9,2,R,8,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123


For this example, it does not matter whether we use a LEFT, RIGHT, INNER, or OUTER join.

In [39]:
pd.merge(lf, location_info, on='location', how='inner')

Unnamed: 0,day,shoe,value,location,location_name,street_address,city,state,zipcode
0,1,W,12,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
1,2,W,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,W,9,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
3,4,W,4,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
4,1,B,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
5,2,B,8,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
6,3,B,22,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
7,4,B,2,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
8,1,R,3,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
9,2,R,6,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123


In [40]:
pd.merge(lf, location_info, on='location', how='outer')

Unnamed: 0,day,shoe,value,location,location_name,street_address,city,state,zipcode
0,1,W,12,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
1,2,W,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,W,9,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
3,4,W,4,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
4,1,B,5,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
5,2,B,8,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
6,3,B,22,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
7,4,B,2,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
8,1,R,3,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
9,2,R,6,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123


## Identify the unique shoes and add context

In [41]:
lf.head()

Unnamed: 0,day,shoe,value,location
0,1,W,12,N
1,2,W,5,N
2,3,W,9,N
3,4,W,4,N
4,1,B,5,N


In [42]:
lf.shoe.value_counts()

W    10
B    10
R    10
O    10
Name: shoe, dtype: int64

In [43]:
shoe_info = lf.groupby(['shoe']).size().reset_index(name='num_rows').drop(columns=['num_rows'])

In [44]:
shoe_info

Unnamed: 0,shoe
0,B
1,O
2,R
3,W


In [45]:
shoe_info['shoe_color'] = pd.Series(['Black', 'Other', 'Red', 'White'],
                                    index=shoe_info.index)

In [46]:
shoe_info

Unnamed: 0,shoe,shoe_color
0,B,Black
1,O,Other
2,R,Red
3,W,White


We can merge all 3 data sets together in a single line of code by CHAINING the JOINS.

Pay close attention to the **on** variable with the JOINS.

In [48]:
pd.merge(lf, shoe_info, on='shoe', how='left').\
merge(location_info, on='location', how='left')

Unnamed: 0,day,shoe,value,location,shoe_color,location_name,street_address,city,state,zipcode
0,1,W,12,N,White,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
1,2,W,5,N,White,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,W,9,N,White,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
3,4,W,4,N,White,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
4,1,B,5,N,Black,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
5,2,B,8,N,Black,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
6,3,B,22,N,Black,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
7,4,B,2,N,Black,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
8,1,R,3,N,Red,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
9,2,R,6,N,Red,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123


## Integer identifers

In [49]:
location_info

Unnamed: 0,location,location_name,street_address,city,state,zipcode
0,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
1,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123


In [50]:
shoe_info

Unnamed: 0,shoe,shoe_color
0,B,Black
1,O,Other
2,R,Red
3,W,White


We will see why in a little bit having each row in the smaller more refined tables be identified by an integer.

In [51]:
location_info['location_id'] = location_info.index + 1

In [52]:
location_info

Unnamed: 0,location,location_name,street_address,city,state,zipcode,location_id
0,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123,1
1,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123,2
2,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123,3


In [53]:
shoe_info['shoe_id'] = shoe_info.index + 1

In [54]:
shoe_info

Unnamed: 0,shoe,shoe_color,shoe_id
0,B,Black,1
1,O,Other,2
2,R,Red,3
3,W,White,4


Merge the unique integer identifiers into the TIDY data set.

In [55]:
lf_copy = pd.merge(lf, shoe_info.loc[:, ['shoe_id', 'shoe']], on='shoe', how='left').\
merge(location_info.loc[:, ['location_id', 'location']], on='location', how='left')

In [56]:
lf_copy.head()

Unnamed: 0,day,shoe,value,location,shoe_id,location_id
0,1,W,12,N,4,2
1,2,W,5,N,4,2
2,3,W,9,N,4,2
3,4,W,4,N,4,2
4,1,B,5,N,1,2


In [57]:
lf_copy.tail()

Unnamed: 0,day,shoe,value,location,shoe_id,location_id
35,3,B,4,P,1,3
36,2,R,4,P,3,3
37,3,R,5,P,3,3
38,2,O,15,P,2,3
39,3,O,11,P,2,3


We can drop the original `shoe` and `location` columns.

In [58]:
lf_copy.drop(columns=['shoe', 'location'], inplace=True)

In [59]:
lf_copy.head()

Unnamed: 0,day,value,shoe_id,location_id
0,1,12,4,2
1,2,5,4,2
2,3,9,4,2
3,4,4,4,2
4,1,5,1,2


In [60]:
lf_copy = lf_copy[['day', 'location_id', 'shoe_id', 'value']].copy()

In [61]:
lf_copy.head()

Unnamed: 0,day,location_id,shoe_id,value
0,1,2,4,12
1,2,2,4,5
2,3,2,4,9
3,4,2,4,4
4,1,2,1,5


We can always recover the TIDY analytics ready data by JOINING the 3 data sets together!

In [62]:
pd.merge(lf_copy, shoe_info, on='shoe_id', how='left').\
merge(location_info, on='location_id', how='left')

Unnamed: 0,day,location_id,shoe_id,value,shoe,shoe_color,location,location_name,street_address,city,state,zipcode
0,1,2,4,12,W,White,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
1,2,2,4,5,W,White,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,2,4,9,W,White,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
3,4,2,4,4,W,White,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
4,1,2,1,5,B,Black,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
5,2,2,1,8,B,Black,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
6,3,2,1,22,B,Black,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
7,4,2,1,2,B,Black,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
8,1,2,3,3,R,Red,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
9,2,2,3,6,R,Red,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123


In [63]:
lf_copy.head()

Unnamed: 0,day,location_id,shoe_id,value
0,1,2,4,12
1,2,2,4,5
2,3,2,4,9
3,4,2,4,4
4,1,2,1,5


Add in a unique integer row identifier to the **link** table.

In [64]:
shoe_per_day = lf_copy.copy()

In [65]:
shoe_per_day['id'] = shoe_per_day.index + 1

In [66]:
shoe_per_day.head()

Unnamed: 0,day,location_id,shoe_id,value,id
0,1,2,4,12,1
1,2,2,4,5,2
2,3,2,4,9,3
3,4,2,4,4,4
4,1,2,1,5,5


In [67]:
shoe_per_day.tail()

Unnamed: 0,day,location_id,shoe_id,value,id
35,3,3,1,4,36
36,2,3,3,4,37
37,3,3,3,5,38
38,2,3,2,15,39
39,3,3,2,11,40


Rearrange the columns.

In [68]:
['id'] + lf_copy.columns.to_list()

['id', 'day', 'location_id', 'shoe_id', 'value']

In [69]:
shoe_per_day = shoe_per_day[['id'] + lf_copy.columns.to_list()].copy()

In [70]:
shoe_per_day.head()

Unnamed: 0,id,day,location_id,shoe_id,value
0,1,1,2,4,12
1,2,2,2,4,5
2,3,3,2,4,9
3,4,4,2,4,4
4,5,1,2,1,5


Rearrange the columns for the shoe and location DataFrames as well.

In [71]:
shoe_info

Unnamed: 0,shoe,shoe_color,shoe_id
0,B,Black,1
1,O,Other,2
2,R,Red,3
3,W,White,4


In [72]:
['shoe_id'] + shoe_info.columns.to_list()[:-1]

['shoe_id', 'shoe', 'shoe_color']

In [73]:
shoe_info = shoe_info[['shoe_id'] + shoe_info.columns.to_list()[:-1]].copy()

In [74]:
shoe_info

Unnamed: 0,shoe_id,shoe,shoe_color
0,1,B,Black
1,2,O,Other
2,3,R,Red
3,4,W,White


In [76]:
location_info

Unnamed: 0,location,location_name,street_address,city,state,zipcode,location_id
0,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123,1
1,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123,2
2,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123,3


In [75]:
['location_id'] + location_info.columns.to_list()[:-1]

['location_id',
 'location',
 'location_name',
 'street_address',
 'city',
 'state',
 'zipcode']

In [77]:
location_info = location_info[['location_id'] + location_info.columns.to_list()[:-1]]

In [78]:
location_info

Unnamed: 0,location_id,location,location_name,street_address,city,state,zipcode
0,1,D,Dunkin Donuts,3907 Forbes Ave,Pittsburgh,PA,15123
1,2,N,Noodles & Company,3805 Forbes Ave,Pittsburgh,PA,15123
2,3,P,Panera Bread,3800 Forbes Ave,Pittsburgh,PA,15123


Save the 3 DataFrames to CSV files.

In [79]:
shoe_per_day.to_csv('../data/shoe_per_day_table.csv', header=True, index=False)

In [80]:
shoe_info.to_csv('../data/shoe_info_table.csv', header=True, index=False)

In [81]:
location_info.to_csv('../data/location_info_table.csv', header=True, index=False)