# Pandas - Prof. Xin Xue
### More binning;  Melting; Pivoting
##### 2/14/22



In [206]:
import pandas as pd
import numpy as np

In [207]:
np.random.seed(1)

Create a 2D array

In [208]:
arr = np.random.multivariate_normal(mean = [1, 0.5],
                                    cov = [[1,0],[0,1]], 
                                    size = 10000)
arr.shape

(10000, 2)

Create a Pandas DF:
- fill in with the 2D array from earlier
- give it col names

In [209]:
df = pd.DataFrame(arr, columns=['x1','x2'])
df.head()

Unnamed: 0,x1,x2
0,2.624345,-0.111756
1,0.471828,-0.572969
2,1.865408,-1.801539
3,2.744812,-0.261207
4,1.319039,0.25063


# Binning 
Binning = bucketizing val's into smaller sections
- Use the `pd.cut(DF['col'])` command

ex) Here we'll be setting the bin boundaries to:
- negative infinity
- (-1)
- 0
- 1
- positive infinity

Reminder: we saw this binning last week

In [210]:
pd.cut(df['x1'], 
       bins = [-np.inf, -1, 0, 1, np.inf], 
       labels = [1,2,3,4])

0       4
1       3
2       4
3       4
4       4
       ..
9995    4
9996    1
9997    4
9998    4
9999    4
Name: x1, Length: 10000, dtype: category
Categories (4, int64): [1 < 2 < 3 < 4]

Another common type of binning using qcut (quartiles)
- This bins/cuts into quartiles of 25% each

In [211]:
df['quartile_x1'] = pd.qcut(df['x1'], 
                            q = [0, 0.25, 0.50, 0.75, 1], 
                            labels = ['1st_quartile',
                                      '2nd_quartile',
                                      '3rd_quartile',
                                      '4th_quartile'])
df.head()

Unnamed: 0,x1,x2,quartile_x1
0,2.624345,-0.111756,4th_quartile
1,0.471828,-0.572969,2nd_quartile
2,1.865408,-1.801539,4th_quartile
3,2.744812,-0.261207,4th_quartile
4,1.319039,0.25063,3rd_quartile


# Group by related
Group by quartile_x1 
- Get means and STD of x2 col
- agg = aggregate computations of: mean and std 
- can also use 'apply' instead

In [212]:
df.groupby(['quartile_x1'])['x2'].agg(['mean','std']).reset_index()

Unnamed: 0,quartile_x1,mean,std
0,1st_quartile,0.530831,1.005354
1,2nd_quartile,0.474668,0.98434
2,3rd_quartile,0.518893,1.006167
3,4th_quartile,0.481867,0.981813


This adds in variance to the aggregate function

In [213]:
df.groupby(['quartile_x1']).agg({'x1':'mean','x2':['std','var']})

Unnamed: 0_level_0,x1,x2,x2
Unnamed: 0_level_1,mean,std,var
quartile_x1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1st_quartile,-0.263848,1.005354,1.010736
2nd_quartile,0.693346,0.98434,0.968924
3rd_quartile,1.351169,1.006167,1.012371
4th_quartile,2.289905,0.981813,0.963957


For transform you can do whatever User Defined Function (UDF) you want
- Here we use `mean` 
- Transform changes it into a col of the original DF

This just pulls that group mean value into the DF as a column:
- NOTE: it repeats the same value per quartile

In [214]:
df['mean_x2_by_qx1'] = df.groupby(['quartile_x1'])['x2'].transform('mean')
df

Unnamed: 0,x1,x2,quartile_x1,mean_x2_by_qx1
0,2.624345,-0.111756,4th_quartile,0.481867
1,0.471828,-0.572969,2nd_quartile,0.474668
2,1.865408,-1.801539,4th_quartile,0.481867
3,2.744812,-0.261207,4th_quartile,0.481867
4,1.319039,0.250630,3rd_quartile,0.518893
...,...,...,...,...
9995,1.383964,-0.318778,3rd_quartile,0.518893
9996,-1.124622,-0.921937,1st_quartile,0.530831
9997,2.109570,-0.443208,4th_quartile,0.481867
9998,1.782216,2.908434,4th_quartile,0.481867


QUESTION) 
- Want to compute x2 / sum(x2) within each bucket

ANSWER #1)

In [215]:
df['sum_x2_by_qx1'] = df.groupby(['quartile_x1'])['x2'].transform('sum')

df['ratio_x2_to_sumx2'] = df['x2'] / df['sum_x2_by_qx1']

df

Unnamed: 0,x1,x2,quartile_x1,mean_x2_by_qx1,sum_x2_by_qx1,ratio_x2_to_sumx2
0,2.624345,-0.111756,4th_quartile,0.481867,1204.667016,-0.000093
1,0.471828,-0.572969,2nd_quartile,0.474668,1186.669886,-0.000483
2,1.865408,-1.801539,4th_quartile,0.481867,1204.667016,-0.001495
3,2.744812,-0.261207,4th_quartile,0.481867,1204.667016,-0.000217
4,1.319039,0.250630,3rd_quartile,0.518893,1297.233311,0.000193
...,...,...,...,...,...,...
9995,1.383964,-0.318778,3rd_quartile,0.518893,1297.233311,-0.000246
9996,-1.124622,-0.921937,1st_quartile,0.530831,1327.078695,-0.000695
9997,2.109570,-0.443208,4th_quartile,0.481867,1204.667016,-0.000368
9998,1.782216,2.908434,4th_quartile,0.481867,1204.667016,0.002414


ANSWER #2) 

- Another way to compute x2 / sum(x2)  within each bucket

In [216]:
udf_ratio = lambda x: x/x.sum()

# Lambda is the same as defining this func:
# def udf_ratio(x): 
#   return x/x.sum()

In [217]:
df['ratio_x2_to_sumx2'] = df.groupby(['quartile_x1'])['x2'].apply(udf_ratio)
df

Unnamed: 0,x1,x2,quartile_x1,mean_x2_by_qx1,sum_x2_by_qx1,ratio_x2_to_sumx2
0,2.624345,-0.111756,4th_quartile,0.481867,1204.667016,-0.000093
1,0.471828,-0.572969,2nd_quartile,0.474668,1186.669886,-0.000483
2,1.865408,-1.801539,4th_quartile,0.481867,1204.667016,-0.001495
3,2.744812,-0.261207,4th_quartile,0.481867,1204.667016,-0.000217
4,1.319039,0.250630,3rd_quartile,0.518893,1297.233311,0.000193
...,...,...,...,...,...,...
9995,1.383964,-0.318778,3rd_quartile,0.518893,1297.233311,-0.000246
9996,-1.124622,-0.921937,1st_quartile,0.530831,1327.078695,-0.000695
9997,2.109570,-0.443208,4th_quartile,0.481867,1204.667016,-0.000368
9998,1.782216,2.908434,4th_quartile,0.481867,1204.667016,0.002414


Let's create a new DF for family spending

In [218]:
consumption = {'family': [1,1,2,2],
               'gender': [0,1,0,1],
               'spend': [50, 100, 75, 80]}
df_spend = pd.DataFrame(consumption)
df_spend

Unnamed: 0,family,gender,spend
0,1,0,50
1,1,1,100
2,2,0,75
3,2,1,80


Q) How much does each person's spending account for w/in each family?

- `udf_ratio = lambda x: x/x.sum()`

The UDF ratio (from above) is calculating the spend w/in each GROUP, then dividing by the sum 

    - Essentially, this gives the % of your spend in each group

Calculate each person's portion of the family spending

In [219]:
df_spend['ratio'] = df_spend.groupby('family')['spend'].apply(udf_ratio)
df_spend

Unnamed: 0,family,gender,spend,ratio
0,1,0,50,0.333333
1,1,1,100,0.666667
2,2,0,75,0.483871
3,2,1,80,0.516129


# Melt and Pivot

Perform opposite / complementary functions:


*   melt transforms from wide to long
*  pivot transforms from long to wide



In [220]:
# Let's pare the table back down to just 2 columns
df = df[['x1','x2']]

### Widen table

...by adding a couple columns

In [221]:
# Following gives a 'SettingWithCopyWarning' warning:
#    "A value is trying to be set on a copy of a slice...."
# This is just a warning - could be an issue, depends on the situation
#   warning can be disabled via
#   pd.options.mode.chained_assignment = None  # default='warn'
#   (execute just after the pandas import)

df['x3'] = df.loc[:, 'x1'] + df.loc[:, 'x2']
df['x4'] = df.loc[:,'x1'] * df.loc[:,'x2']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['x3'] = df.loc[:, 'x1'] + df.loc[:, 'x2']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['x4'] = df.loc[:,'x1'] * df.loc[:,'x2']


In [222]:
df

Unnamed: 0,x1,x2,x3,x4
0,2.624345,-0.111756,2.512589,-0.293287
1,0.471828,-0.572969,-0.101140,-0.270343
2,1.865408,-1.801539,0.063869,-3.360604
3,2.744812,-0.261207,2.483605,-0.716964
4,1.319039,0.250630,1.569669,0.330590
...,...,...,...,...
9995,1.383964,-0.318778,1.065186,-0.441178
9996,-1.124622,-0.921937,-2.046559,1.036830
9997,2.109570,-0.443208,1.666362,-0.934979
9998,1.782216,2.908434,4.690650,5.183457


In [223]:
# Here we go - melt from wide to long table - note this only uses cols x1 and x2
#df_melted = pd.melt(df, value_vars=['x1','x2'], value_name='x')
#...this version uses all (non-id) columns
df_melted = pd.melt(df, value_name='x')

In [224]:
df_melted

Unnamed: 0,variable,x
0,x1,2.624345
1,x1,0.471828
2,x1,1.865408
3,x1,2.744812
4,x1,1.319039
...,...,...
39995,x4,-0.441178
39996,x4,1.036830
39997,x4,-0.934979
39998,x4,5.183457


In [225]:
# let's see what we have
df_melted.groupby(['variable']).size()

variable
x1    10000
x2    10000
x3    10000
x4    10000
dtype: int64

### Pivot back to original layout

We melted wide to long.....now revert back to wide.

In [226]:
# Change back to original layout

# this is how you would do it - currently there is an issue ****
#pd.pivot_table(df_melted,columns='variable')
#pd.pivot(df_melted,values='x',columns='variable')
# ** still broken ***

## Mimicing melting 

Splitting df into smaller df's that need to be combined.

I.e. these are "wide" because they are in multiple/parallel dfs.

Often you don't care where the data came from you're just merging streams.

### Example
Pipeline workflow::
```
output -> shard1.csv, shard2.csv... df1, df2, df3, ... 
pd.concat([df1, df2, df3...])
```



In [227]:
# Create a couple parallel dfs
df_x1 = df[['x1']]
df_x2 = df[['x2']]

# (optional) can add labels to show source of each data point
df_x1['label'] = 'x1'
df_x2['label'] = 'x2'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x1['label'] = 'x1'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x2['label'] = 'x2'


In [228]:
df_x1.head()


Unnamed: 0,x1,label
0,2.624345,x1
1,0.471828,x1
2,1.865408,x1
3,2.744812,x1
4,1.319039,x1


In [229]:
# start by renaming columns so concatenation does not require column mapping
df_x1.rename(columns={'x1':'x'}, inplace=True)
df_x2.rename(columns={'x2':'x'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [230]:
df_x2.head()

Unnamed: 0,x,label
0,-0.111756,x2
1,-0.572969,x2
2,-1.801539,x2
3,-0.261207,x2
4,0.25063,x2


In [231]:
df_stacked = pd.concat([df_x1, df_x2])

In [232]:
# note, like the melted when we ended up with 40K rows
df_stacked

Unnamed: 0,x,label
0,2.624345,x1
1,0.471828,x1
2,1.865408,x1
3,2.744812,x1
4,1.319039,x1
...,...,...
9995,-0.318778,x2
9996,-0.921937,x2
9997,-0.443208,x2
9998,2.908434,x2


### Combining via Joins

Takes similar datasets and makes a single wider df

In [233]:
# Join horizontally, not vertically
#    have to differentiate col names - so give suffixes based on left/right
#  This will join on index (could provide a key column to match on)
df_x1.join(df_x2, lsuffix='_l', rsuffix='_r')

Unnamed: 0,x_l,label_l,x_r,label_r
0,2.624345,x1,-0.111756,x2
1,0.471828,x1,-0.572969,x2
2,1.865408,x1,-1.801539,x2
3,2.744812,x1,-0.261207,x2
4,1.319039,x1,0.250630,x2
...,...,...,...,...
9995,1.383964,x1,-0.318778,x2
9996,-1.124622,x1,-0.921937,x2
9997,2.109570,x1,-0.443208,x2
9998,1.782216,x1,2.908434,x2


In [234]:
# Merge is similar to join but can work on columns, also
#    a more general function
#  this is again using the index from both to join on
df_x1.merge(df_x2, left_index=True, right_index=True)

Unnamed: 0,x_x,label_x,x_y,label_y
0,2.624345,x1,-0.111756,x2
1,0.471828,x1,-0.572969,x2
2,1.865408,x1,-1.801539,x2
3,2.744812,x1,-0.261207,x2
4,1.319039,x1,0.250630,x2
...,...,...,...,...
9995,1.383964,x1,-0.318778,x2
9996,-1.124622,x1,-0.921937,x2
9997,2.109570,x1,-0.443208,x2
9998,1.782216,x1,2.908434,x2


### ...or sometimes you want to merge on a particular column - can do SQL-style
```
df_x1.merge(df_x2, on=['**colName**'], how='inner'|'left'|'right'|'outer')
```

In [235]:
# example building on the df_spend case
df_spend

Unnamed: 0,family,gender,spend,ratio
0,1,0,50,0.333333
1,1,1,100,0.666667
2,2,0,75,0.483871
3,2,1,80,0.516129


In [236]:
address={'address':['MD','VA','DC'], 'family':[1,2,3]}

In [237]:
df_address = pd.DataFrame(address)
df_address

Unnamed: 0,address,family
0,MD,1
1,VA,2
2,DC,3


In [238]:
# inner...keep only matches
df_spend.merge(df_address, on='family', how='inner')

Unnamed: 0,family,gender,spend,ratio,address
0,1,0,50,0.333333,MD
1,1,1,100,0.666667,MD
2,2,0,75,0.483871,VA
3,2,1,80,0.516129,VA


In [239]:
# outer...keep all
df_spend.merge(df_address, on='family', how='outer')

Unnamed: 0,family,gender,spend,ratio,address
0,1,0.0,50.0,0.333333,MD
1,1,1.0,100.0,0.666667,MD
2,2,0.0,75.0,0.483871,VA
3,2,1.0,80.0,0.516129,VA
4,3,,,,DC


### Splitting data

Use case:  train/test split

Need a training set, and test set
Same data cols, diff rows

Want random selection

In [240]:
# can shuffle the data
index = np.arange(0, len(df))

In [241]:
np.random.seed(1)
np.random.shuffle(index)


In [242]:
# didn't want to modify the original set so making a copy
df_shuffled = df.iloc[index,:].copy()

In [243]:
df_shuffled

Unnamed: 0,x1,x2,x3,x4
9953,1.065580,1.609592,2.675172,1.715149
3850,1.411033,0.039863,1.450896,0.056248
4962,2.660633,2.133164,4.793797,5.675568
3886,0.426264,-0.276243,0.150021,-0.117752
5437,2.211217,0.893136,3.104353,1.974918
...,...,...,...,...
2895,1.031140,0.447048,1.478188,0.460969
7813,0.480924,-1.093394,-0.612471,-0.525839
905,2.009090,0.611973,2.621063,1.229508
5192,-0.068265,1.806465,1.738201,-0.123318


In [244]:
# Of course you could always use a function vs. a lambda  (is there any perf diff?)
#def splitting(df, train_percent):
# ... you finish here