# Data Ingestion

* `Pandas DataFrame` 

* `Pandas I/O`

* `Pandas Groupby` 

* `Pandas Merge`

* `Pandas Concat`

---

### `Pandas DataFrame`


> A Pandas DataFrame is a two-dimensional, tabular data structure in the Python library Pandas, often used for data analysis and manipulation.   
> It is similar to a spreadsheet in Excel or a SQL table and provides flexible and powerful ways to handle data.
   
> Here are some of the key features of a Pandas DataFrame:  

> - **Rows and Columns:** A DataFrame consists of rows and columns, where each column has a name (column label) and each row has an index (row label).
> - **Data Integration:** A DataFrame can read data from various sources like CSV files, Excel files, SQL databases, JSON files, and more.
> - **Data Manipulation**: You can add, remove, or rename columns, filter data, group, aggregate, and merge data.
> - **Flexibility**: A DataFrame can contain different data types in different columns, such as numerical values, strings, dates, and more.

> For full documentation please see: [Pandas User Guide](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

Import Pandas

In [238]:
import pandas as pd

Create <u>empty</u> Pandas DataFrame

In [239]:
df = pd.DataFrame()

In [240]:
display(df)

In [241]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


Create <u>empty</u> Pandas DataFrame <u>with row and column labels</u>

In [242]:
df = pd.DataFrame(index=[1, 2, 3], columns=["Column1", "Column2", "Column3"])

In [243]:
display(df)

Unnamed: 0,Column1,Column2,Column3
1,,,
2,,,
3,,,


In [244]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


Create DataFrame <u>with example data</u>

In [245]:
example_data = {
  "Name": {
    "0": "Alice",
    "1": "Bob",
    "2": "Charlie",
    "3": "David",
    "4": "Eva"
  },
  "Age": {
    "0": 24,
    "1": 27,
    "2": 22,
    "3": 32,
    "4": 29
  },
  "City": {
    "0": "New York",
    "1": "Los Angeles",
    "2": "Chicago",
    "3": "Houston",
    "4": "Phoenix"
  }
}

In [246]:
df = pd.DataFrame(example_data)

In [247]:
display(df)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston
4,Eva,29,Phoenix


### `Pandas I/O`

> The pandas I/O API is a set of top level `reader` functions accessed like `pandas.read_csv()` that generally return a pandas object. 
> The corresponding `writer` functions are object methods that are accessed like `DataFrame.to_csv()`. 

> Below is a table containing just some examples for available `readers` and `writers`. 

> For full documentation please see: [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/io.html)

| **Format** | **Reader** | **Writer** |
|----------|----------|----------|
| XLSX | read_excel | to_excel |
| CSV | read_csv | to_csv |
| JSON | read_json | to_json | 


`XLSX` Files

In [248]:
#%pip install openpyxl

In [249]:
import openpyxl

In [250]:
# Path
xlsx_file_path = "data/example.xlsx"

In [251]:
# Writer
df.to_excel(xlsx_file_path, index=False)

In [252]:
# Reader
xlsx_df = pd.read_excel(xlsx_file_path)

In [253]:
display(xlsx_df)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston
4,Eva,29,Phoenix


In [254]:
print(type(xlsx_df))

<class 'pandas.core.frame.DataFrame'>


`CSV` Files 

In [255]:
# Path
csv_file_path = "data/example.csv"

In [256]:
# Writer
df.to_csv(csv_file_path, index=False)

In [257]:
# Reader
csv_df = pd.read_csv(csv_file_path)

In [258]:
display(csv_df)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston
4,Eva,29,Phoenix


In [259]:
print(type(csv_df))

<class 'pandas.core.frame.DataFrame'>


`JSON` Files

In [260]:
# Path
json_file_path = "data/example.json"

In [261]:
# Writer
df.to_json(json_file_path, orient="columns")

In [262]:
# Reader
json_df = pd.read_json(json_file_path)

In [263]:
display(json_df)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston
4,Eva,29,Phoenix


In [264]:
print(type(json_df))

<class 'pandas.core.frame.DataFrame'>


In [265]:
# TO DO: I/O Saklia MySQL Database

### `Pandas Groupby`

> A groupby operation involves some combination of splitting the object, applying a function, and combining the results.   
> This can be used to group large amounts of data and compute operations on these groups.

> For full documentation please see: [Pandas User Guide](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

**Example Data**

Create DataFrame with products and dates

In [266]:
# Products
products = ["A", "B", "C"]

# Dates ranging from 2023-01-01 to 2023-12-31
dates = pd.date_range(start='2023-01-01', end='2023-12-31')

# Cartesian product
df = pd.DataFrame([(product, date) for product in products for date in dates],
                  columns=['product', 'date'])

In [267]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

print(f"df contains {len(products)} x {len(dates)} = {len(df)} rows")
display(df.head(5))
display(df.tail(5))

df contains 3 x 365 = 1095 rows


Unnamed: 0,product,date
0,A,2023-01-01
1,A,2023-01-02
2,A,2023-01-03
3,A,2023-01-04
4,A,2023-01-05


Unnamed: 0,product,date
1090,C,2023-12-27
1091,C,2023-12-28
1092,C,2023-12-29
1093,C,2023-12-30
1094,C,2023-12-31


Extract quarter, month, week and weekday from dates

In [268]:
# Quarter
df["quarter"] = df["date"].dt.quarter
# Month
df["month"] = df["date"].dt.month
# Week
df["week"] = df["date"].dt.isocalendar().week
# Weekday
df["weekday"] = df["date"].dt.weekday

In [269]:
display(df.head(5))
display(df.tail(5))

Unnamed: 0,product,date,quarter,month,week,weekday
0,A,2023-01-01,1,1,52,6
1,A,2023-01-02,1,1,1,0
2,A,2023-01-03,1,1,1,1
3,A,2023-01-04,1,1,1,2
4,A,2023-01-05,1,1,1,3


Unnamed: 0,product,date,quarter,month,week,weekday
1090,C,2023-12-27,4,12,52,2
1091,C,2023-12-28,4,12,52,3
1092,C,2023-12-29,4,12,52,4
1093,C,2023-12-30,4,12,52,5
1094,C,2023-12-31,4,12,52,6


Add column with random sales

In [270]:
import numpy as np

# Random seed for reproducability
np.random.seed(42)

# Random sales
df["sales"] = np.random.randint(10000, 50000, size=len(df))

In [271]:
display(df.head(5))
display(df.tail(5))

Unnamed: 0,product,date,quarter,month,week,weekday,sales
0,A,2023-01-01,1,1,52,6,25795
1,A,2023-01-02,1,1,1,0,10860
2,A,2023-01-03,1,1,1,1,48158
3,A,2023-01-04,1,1,1,2,21284
4,A,2023-01-05,1,1,1,3,16265


Unnamed: 0,product,date,quarter,month,week,weekday,sales
1090,C,2023-12-27,4,12,52,2,28922
1091,C,2023-12-28,4,12,52,3,22175
1092,C,2023-12-29,4,12,52,4,31389
1093,C,2023-12-30,4,12,52,5,10822
1094,C,2023-12-31,4,12,52,6,27327


Add missing values

In [289]:
# Number of NaN values
num_nan = 95

# Random seed for reproducability
np.random.seed(42)

# Random indices to set as NaN
nan_indices = np.random.choice(df.index, size=num_nan)

# Replace with NaN
df.loc[nan_indices, "sales"] = np.NaN

In [290]:
display(df.head(20))

Unnamed: 0,product,date,quarter,month,week,weekday,sales
0,A,2023-01-01,1,1,52,6,25795.0
1,A,2023-01-02,1,1,1,0,10860.0
2,A,2023-01-03,1,1,1,1,48158.0
3,A,2023-01-04,1,1,1,2,21284.0
4,A,2023-01-05,1,1,1,3,16265.0
5,A,2023-01-06,1,1,1,4,26850.0
6,A,2023-01-07,1,1,1,5,47194.0
7,A,2023-01-08,1,1,1,6,31962.0
8,A,2023-01-09,1,1,2,0,26023.0
9,A,2023-01-10,1,1,2,1,11685.0


**`groupby` operations**

<u>Single</u> Aggregations

In [300]:
# Sum
df_sum = df.groupby(by=["product"], as_index=False)["sales"].sum()

In [301]:
display(df_sum)

Unnamed: 0,product,sales
0,A,9393995.0
1,B,10229745.0
2,C,9597076.0


In [302]:
# Mean
df_mean = df.groupby(by=["product"], as_index=False)["sales"].mean()

In [303]:
display(df_mean)

Unnamed: 0,product,sales
0,A,28125.733533
1,B,30445.669643
2,C,28906.855422


In [311]:
# Median
df_median = df.groupby(by=["product"], as_index=False)["sales"].median()

In [312]:
display(df_median)

Unnamed: 0,product,sales
0,A,27872.5
1,B,31292.5
2,C,28009.0


In [338]:
# Minimum
df_min = df.groupby(by=["quarter"], as_index=False)["sales"].min()

# Maximum
df_max = df.groupby(by=["quarter"], as_index=False)["sales"].max()

In [339]:
display(df_min)

Unnamed: 0,quarter,sales
0,1,10055.0
1,2,10009.0
2,3,10009.0
3,4,10190.0


In [340]:
display(df_max)

Unnamed: 0,quarter,sales
0,1,49976.0
1,2,49954.0
2,3,49964.0
3,4,49986.0


In [341]:
# Size (missing values included)
df_size = df.groupby(by=["month"], as_index=False)["sales"].size()

# Count (missing values excluded)
df_count = df.groupby(by=["month"], as_index=False)["sales"].count()

In [343]:
display(df_size)

Unnamed: 0,month,size
0,1,93
1,2,84
2,3,93
3,4,90
4,5,93
5,6,90
6,7,93
7,8,93
8,9,90
9,10,93


In [345]:
display(df_count)

Unnamed: 0,month,sales
0,1,83
1,2,74
2,3,88
3,4,83
4,5,82
5,6,86
6,7,84
7,8,86
8,9,86
9,10,84


In [347]:
# Unique values count
df_nunique = df.groupby(by=["product"], as_index=False)["week"].nunique()

In [325]:
display(df_nunique)

Unnamed: 0,product,week
0,A,52
1,B,52
2,C,52


<u> Multiple </u> Aggregations

In [348]:
df_agg = df.groupby(by=["weekday"], as_index=False).agg(
    {"sales": ["count", "min", "mean", "median", "max"]}
)

In [349]:
display(df_agg)

Unnamed: 0_level_0,weekday,sales,sales,sales,sales,sales
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,mean,median,max
0,0,138,10161.0,29479.304348,29290.0,49976.0
1,1,152,10117.0,30121.203947,30115.0,49954.0
2,2,141,10189.0,29449.134752,29758.0,49964.0
3,3,143,10009.0,28763.776224,27449.0,49811.0
4,4,143,10404.0,27410.34965,26389.0,49649.0
5,5,140,10009.0,29484.764286,31338.0,49764.0
6,6,145,10077.0,29387.268966,29816.0,49986.0


<u>Custom</u> Aggregations

In [350]:
# Custom function
def range_func(x):
    return x.max() - x.min()

In [351]:
df_custom_agg = df.groupby(by=["weekday"], as_index=False).agg(
    {"sales": ["count", "min", "mean", "median", "max", range_func]}
)

In [352]:
display(df_custom_agg)

Unnamed: 0_level_0,weekday,sales,sales,sales,sales,sales,sales
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,mean,median,max,range_func
0,0,138,10161.0,29479.304348,29290.0,49976.0,39815.0
1,1,152,10117.0,30121.203947,30115.0,49954.0,39837.0
2,2,141,10189.0,29449.134752,29758.0,49964.0,39775.0
3,3,143,10009.0,28763.776224,27449.0,49811.0,39802.0
4,4,143,10404.0,27410.34965,26389.0,49649.0,39245.0
5,5,140,10009.0,29484.764286,31338.0,49764.0,39755.0
6,6,145,10077.0,29387.268966,29816.0,49986.0,39909.0


Aggregation with <u>multiple group columns</u>

In [353]:
# Multiple group columns
df_agg_multiple = df.groupby(by=["product", "weekday"], as_index=False).agg(
    {"sales": ["count", "min", "mean", "median", "max", range_func]}
)

In [354]:
display(df_agg_multiple)

Unnamed: 0_level_0,product,weekday,sales,sales,sales,sales,sales,sales
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,min,mean,median,max,range_func
0,A,0,46,10161.0,28623.130435,27999.5,49976.0,39815.0
1,A,1,50,10206.0,27076.52,24290.0,49567.0,39361.0
2,A,2,48,10189.0,27325.875,26182.5,48754.0,38565.0
3,A,3,50,10190.0,27957.2,27411.0,49353.0,39163.0
4,A,4,48,10663.0,28585.041667,27420.5,48660.0,37997.0
5,A,5,47,10569.0,29526.042553,30932.0,49734.0,39165.0
6,A,6,45,10197.0,27871.044444,28141.0,49974.0,39777.0
7,B,0,45,10698.0,30056.933333,31732.0,47497.0,36799.0
8,B,1,51,11542.0,32448.254902,33289.0,49954.0,38412.0
9,B,2,47,11324.0,30468.0,30491.0,48649.0,37325.0


<u>Rolling</u> Aggregates

In [379]:
df_rolling_agg = df.copy()

In [384]:
# Rolling Sum, Mean, Std Deviation
result = df["sales"].rolling(window=3).agg(["sum", "mean", "std"])

df_rolling_agg[["sales_roll_sum_3", "sales_roll_mean_3", "sales_roll_std_3"]] = result

In [386]:
display(df_rolling_agg.head(10))

Unnamed: 0,product,date,quarter,month,week,weekday,sales,sales_roll_sum_3,sales_roll_mean_3,sales_roll_std_3
0,A,2023-01-01,1,1,52,6,25795.0,,,
1,A,2023-01-02,1,1,1,0,10860.0,,,
2,A,2023-01-03,1,1,1,1,48158.0,84813.0,28271.0,18771.870791
3,A,2023-01-04,1,1,1,2,21284.0,80302.0,26767.333333,19244.100637
4,A,2023-01-05,1,1,1,3,16265.0,85707.0,28569.0,17149.177269
5,A,2023-01-06,1,1,1,4,26850.0,64399.0,21466.333333,5294.855081
6,A,2023-01-07,1,1,1,5,47194.0,90309.0,30103.0,15719.009733
7,A,2023-01-08,1,1,1,6,31962.0,106006.0,35335.333333,10583.199768
8,A,2023-01-09,1,1,2,0,26023.0,105179.0,35059.666667,10920.140307
9,A,2023-01-10,1,1,2,1,11685.0,69670.0,23223.333333,10424.384027


### `Pandas Merge`

`Left Join`

`Inner Join`

### `Pandas Concat`