# Hands on - Applied data engineering with Pandas

### ...or creating a simple ETL process

In this hands-on session, we will again work with the data from the ACM case. However, in the last module some data scientists have already invested some time in data engineering and wrangling.

Given our newly gained pandas skills, we now want to follow their path...


# 1) Importing Files

Import the survey data into pandas. However, the survey data is stored in three different sheets in the data file ("2019", "2020", and "2021"). Load them into pandas.


In [None]:
import pandas as pd

In [None]:
survey2019 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2019")

In [None]:
survey2020 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2020")

In [None]:
survey2021 = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_Survey_Data.xlsx", sheet_name="2021")

Have a look at the three dataframes. They all have the same sructure and identical variable names. Paste theme together into a new dataframe.

In [None]:
survey2019.head()

In [None]:
survey2019.info()

In [None]:
survey2020.info()

In [None]:
survey2021.info()

Combine files row-wise or column-wise

*   set **axis=0** to row-wise combination
*   set **axis=1** to row-wise combination

In [None]:
survey = pd.concat([survey2019, survey2020, survey2021], axis = 0)
survey.info()

Now also read in the intentionality results using an appropriate reading function. Watch out for the delimeter!



In [None]:
intentionality = pd.read_csv("https://raw.githubusercontent.com/casbdai/notebooks2023/main/Module2/DataEngineeringPandas/Pandas_TV_Intentionality_Data.csv", sep=";")
intentionality.info()

We need to fix the variable type of "date"

In [None]:
intentionality.date = pd.to_datetime(intentionality.date)
intentionality.info()

In [None]:
gtrends = pd.read_excel("https://github.com/casbdai/notebooks2023/raw/main/Module2/DataEngineeringPandas/Pandas_TV_GTrends_Data.xlsx")
gtrends.info()

# 2) Merging Files

Now after having loaded the data, we want to combine the data into one overarching data set. However, be aware that the data needs to be joined on three variables: Industry Ad Type, Program Name and date / Date Aired

Perform an inner join of the data.

In [None]:
inner =  pd.merge(survey, intentionality,
                  how="inner",
                  left_on=["IndustryAdType", "ProgramName", "DateAired"],
                  right_on=["IndustryAdType", "ProgramName", "date"])

inner.info()

Perform an left join of the data

In [None]:
left =  pd.merge(survey, intentionality,
                  how="left",
                  left_on=["IndustryAdType", "ProgramName","DateAired"],
                  right_on=["IndustryAdType", "ProgramName","date"])

left.info()

How many NaNs are introduced in the variable intentionality? (you can use .info() )

Number of NaN: __

Which joining method would you use for combining the two dataframe? Why?

Your answer: __________________________

In [None]:
left =  pd.merge(left, gtrends,
                  how="left",
                  left_on=["IndustryAdType", "DateAired"],
                  right_on=["IndustryAdType", "date"])

left.info()

# 3) Dealing with NA

In order to practice our "dealing with missing data skills", we have to decided to go with an outer join.

Create a new dataframe in which you have removed all missing values:

In [None]:
acmdata = left.dropna()
acmdata.info()

Create a new dataframe in which you insert 0 into the missing data fields of appropriate variables.

In [None]:
acmdata_0 = left.fillna(value=0)
acmdata_0.info()

# 4) Tranforming Variables

In the following exercises, we use the acmdata dataframe!

Rename the variable "Spend" into "Spend_in_000"

In [None]:
acmdata = acmdata.rename(columns={"Spend": "Spend_in_000"})
acmdata.info()

Delete the Variable "date"

In [None]:
del(acmdata["date_y"])
acmdata.info()

In [None]:
acmdata = acmdata.drop("date_x", axis = 1)
acmdata.info()

Aggregate the acmdata data frame by "IndustryAdType" using .mean()

In [None]:
acmdata.groupby("IndustryAdType").mean()

Aggregate the acm dataframe by "Industry Ad Type" and "Program Name" using .sum()

In [None]:
acmdata.groupby(["IndustryAdType", "ProgramName"]).sum()

Again, aggregate the acmdata dataframe by "Industry Ad Type" and "Program Name" using .sum(). However, you are only interested in the "Spend" and "Impressions" data

In [None]:
acmdata.groupby(["IndustryAdType", "ProgramName"])[["Spend_in_000", "Impressions"]].sum()

### Meaningful plots: Combining aggregations and .plot()

For creating more meaningful and Tableau-like plots in python, you have to combine aggregations with the .plot() method

In [None]:
acmdata.groupby(["DateAired"])["Spend_in_000"].sum().plot()

a barplot of Spend by Program Name

In [None]:
acmdata.groupby(["ProgramName"])["Spend_in_000"].sum().plot(kind="bar")

# Writing Data File

Now, write the merged and tidied data file as excel

In [None]:
acmdata.to_excel("acmdata.xlsx", index=False)

In [None]:
from google.colab import files
files.download('acmdata.xlsx')

Or write the data into an SQL database

In [None]:
import sqlalchemy as db

engine = db.create_engine("sqlite:///cleaned_database")
engine.connect()

acmdata.to_sql('clean_acm_data', con=engine, if_exists="replace", index=False)

inspector = db.inspect(engine)
inspector.get_table_names()