# MANIPULATING TABULAR DATA

## Lesson Objectives

* Review what makes a dataset tidy.
* Meet a complete set of functions for most table manipulations.
* Learn to transform datasets with split-apply-combine procedures.
* Understand the basic join operation.

## Specific Achievements

* Reshape data frames with pandas
* Summarize data by groups with pandas
* Combine multiple data frame operations with pipes
* Combine multiple data frames with “joins” (merge)

Data frames occupy a central place in Python data analysis pipelines. The panda package provide the objects and most necessary tools to subset, reformat and transform data frames. The key functions in both packages have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between python and relational databases.

https://cyberhelp.sesync.org/census-data-manipulation-in-R-lesson/



## Worksheet

## Wide to long

The pandas package’s melt function reshapes “wide” data frames into “long” ones.

In [None]:
import pandas as pd
import numpy as np
trial_df = pd.DataFrame({"block": [1,2,3],
              "drug": [0.22,0.12,0.42],
              "control": [0.58,0.98,0.19],
              "placebo": [0.31,0.47,0.40]})
trial_df.head()

In [None]:
tidy_trial_df = pd.melt(trial_df,
                  id_vars=['block'],
                  var_name='treatment',
                  value_name='response')
tidy_trial_df.head()

## Long to wide

Use the "pivot" function to go from long format to wide.

In [None]:
df2 = tidy_trial_df.pivot(index='block',
                          columns='treatment',
                          values='response')
df2.columns

In [None]:
df2.reset_index()

In [None]:
df2

Consider survey data on participant’s age and income stored in a EAV structure.

In [None]:
from io import StringIO, BytesIO

text_string = StringIO("""
participant,attr,val
1,age,24
2,age,57
3,age,13
1,income,30
2,income,60
""")

survey_df = pd.read_csv(text_string, sep=",")
survey_df

Transform the data with the pivot function, which “reverses” a melt. These are equivalent to spread and gather in the dplyr r package.

In [None]:
tidy_survey = survey_df.pivot(index='participant',
                          columns='attr',
                          values='val')
print(tidy_survey.head())

In [None]:
tidy_survey = tidy_survey.reset_index()
tidy_survey.columns

In [None]:
tidy_survey.reset_index()

Note that "reset_index" adds the index as a column. It generates a new inde starting from 0 to the number of rows minus 1.

In [None]:
tidy_survey

## Sample Data

In [None]:
import pandas as pd
cbp = pd.read_csv('data/cbp15co.csv')
cbp.describe()

In [None]:
print(cbp.dtypes)


In [None]:
import numpy as np
import pandas as pd

cbp = pd.read_csv(
  'data/cbp15co.csv',
  na_values = "NULL",
  keep_default_na=False,
  dtype =  {"FIPSTATE": np.str, 
  "FIPSCTY": np.str}
  )

In [None]:
import pandas as pd
import numpy as np
acs =  pd.read_csv(
  'data/ACS/sector_ACS_15_5YR_S2413.csv',
  dtype = {"FIPS": np.str}
  )

In [None]:
print(acs.dtypes)

## Typical Data Manipulation Functions

## Filter Pattern matching

In [None]:
cbp2 = cbp[cbp['NAICS'].str.contains("----")]
cbp2 = cbp2[~cbp2.NAICS.str.contains("-----")]
cbp2.head()

In [None]:
cbp3 = cbp[cbp['NAICS'].str.contains('[0-9]{2}----')]
cbp3.head()

## Altering, updating and transforming columns


In [None]:
cbp3["FIPS"] = cbp3["FIPSTATE"]+cbp3["FIPSCTY"]

In [None]:
cbp3.assign(FIPS2=lambda x: x['FIPSTATE']+x['FIPSCTY'])

In [None]:
cbp3.shape

In [None]:
cbp3.head()

## Select

In [None]:
cbp2.columns

In [None]:
cbp3 = cbp3[['FIPS','NAICS','N1_4', 'N5_9', 'N10_19']] 
cbp3.head()

In [None]:
cbp4= cbp.filter(regex='^N|FIPS|NAICS',axis=1) 
cbp4.head()

## Join

In [None]:
sector =  pd.read_csv(
  'data/ACS/sector_naics.csv',
  dtype = {"NAICS": np.int64})
print(sector.dtypes)

In [None]:
print(cbp.dtypes)

In [None]:
cbp.head()

In [None]:
print(sector.dtypes)
print(sector.shape) #24 economic sectors
sector.head()

## Many-to-One

In [None]:
logical_idx = cbp['NAICS'].str.match('[0-9]{2}----') #boolean index
cbp = cbp.loc[logical_idx]
cbp.head()

In [None]:
cbp.shape

In [None]:
cbp['NAICS']= cbp.NAICS.apply(lambda x: np.int64(x[0:2])) # select first two digits

In [None]:
#Many to one to join economic sector code to NAICS
cbp_test = cbp.merge(sector, on = "NAICS", how='inner')
cbp_test.head()

In [None]:
print(cbp_test.shape)

## Group By

In [None]:
cbp["FIPS"] = cbp["FIPSTATE"]+cbp["FIPSCTY"]
cbp = cbp.merge(sector, on = "NAICS")

cbp_grouped = cbp.groupby(['FIPS','Sector'])
cbp_grouped

In [None]:
cbp_grouped.dtypes

## Summarize

In [None]:
grouped_df = (cbp
.groupby(['FIPS', 'Sector']) 
.agg('sum')
.filter(regex='^N')
.drop(columns=['NAICS'])
)

grouped_df.head(5)

In [None]:
print(grouped_df.shape)

In [None]:
print(acs.shape)

In [None]:
acs_cbp = grouped_df.merge(acs,on='FIPS',)
print(acs_cbp.shape)

In [None]:
acs_cbp.head()