# Chapter 1

### EDA

- The process of :
    - reviewing and cleaning data
    - deriving insights (descriptive statistics, correlation)
    - generate hypothesis for experiments
    - explore the grouped distribution
    - Visualize numeric distribution (histogram) and categorical distribution (barplot)
        - Use kdeplot for visualizing subgroup distribution in one plot for comparison with `cut` specified
    - Handle outliers
        - See and Visualize numeric mean, max with boxplot to gain insight about outliers
        - filter data to drop outliers
    - Handle missing data:
        - Drop data if missing data in the column <= 5%
        - Impute missing values otherwise
        - Impute by sub-group since each subgroup may contain different type of data
    - Handle diverse categories:
        - There might be so many diverse categories that needs to be generalized
        - eg: job as NLP engineer and Text Analyst may Fall into Machine Learning Engineer job
    - Convert numeric data according to desired standard
        - eg: EUR to USD value, or standardization
    - Visualize correlation with scatterplot (or use pairplot)
        - correlation only gives linear relationship information
        - correlation might be close to 0, but scatter plot may show strong non-linear relationship
        - Correlation might be high, but scatterplot may show quadratic relationship
    - Check if sample accurately represents the population
        - Check class imbalance with value counts
        - Check combination of classes with crosstab for both median value and counts
    - Create necessary columns for further analysis
        - From category or string column to numeric column
        - From range of value to categorical by cutting into bins
        - From date column to more granular representation (weekday, month etc)
    - Generate hypothesis and arrive at conclusive evidence
        - Gather enough data and check correlations
        - Do all correlation make sense? Test with hypothesis testing
        - Hypothesis test will tell if given enough data, the relationship is true or not.

### Pandas groupby

```
# Group "another_col" column by "col1" and "col2" and
# produce min, max and sum of the grouped data
df.groupby(["col1","col2"])["another_col"].agg([min,max,sum])
# Way 2
df.groupby("cat_col").agg({"col1": ["mean", "std"], "col2": ["median"]})
# Way 3
books.groupby("some_col").agg(
    mean_col1=("col1", "mean"),
    std_col2=("col2", "std"),
    median_col3=("col3", "median")
)
# Multi-index groupby
df.groupby(level=0).agg({'col':'mean'}) # Outermost = level 0
# Size per group
df.groupby('col').size()
# Adding subgroup aggregation information into dataframe
df["std_dev"] = df.groupby("cat_col")["num_col"].transform(lambda x: x.std())
```

### Normal EDA procedures

```
# See sample data
df.head()
# See dataframe info
df.info()
# See info about columns
df.dtypes
# See specified data type columns
df.select_dtypes("number").head()
# Convert to desired data type
df["col"] = df["col"].astype(int)
# See categories, label frequencies and value distribution
df["cat_col"].value_counts(normalize=True) # Check class imbalance
pd.crosstab(df["cat_col1" ], df[ "cat_col2" ]) # Check distribution of class combinations
pd.crosstab(df["cat_col1" ], df[ "cat_col2" ],values=planes["num_col"], aggfunc="median")) # Check distribution of median values in class combinations
# Validating categorical columns
df["cat_col"].isin(["Cat 1", "Cat 2"])
# See numerical statistics
df.describe()
# Visualize distribution of numeric data
sns.histplot(data=df, x="col", binwidth=.1)
sns.boxplot(data=df, x="num_col", y="cat_col") # Also looks for outlier
# Visualize distribution of categorical data
sns.barplot(data=df, x="cat_col", y="num_col")
# Explore group distribution
df.groupby("cat_col").agg({"col1": ["mean", "std"], "col2": ["median"]})
# Generalize categorical data : Select n-th generalized category for n-th satisfied condition
df["Generalized_Category"] = np.select(condition_list, generalized_category_list, default="Other")
# Filtering outliers
seventy_fifth = df["num_col"].quantile(0.75)
twenty_fifth = df["num_col"].quantile(0.25)
iqr = seventy_fifth - twenty_fifth
upper = seventy_fifth + (1.5 * iqr)
lower = twenty_fifth - (1.5 * iqr)
df_without_outliers = df[(df["num_col"] > lower) & (df["num_col"] < upper)]
# Create new categorical columns
twenty_fifth = df["num_col"].quantile(0.25)
median = df["num_col"].median()
seventy_fifth = df["num_col"].quantile(0.75)
maximum = df["num_col"].max()
labels = ["A", "B", "C","D"]
bins = [0, twenty_fifth, median, seventy_fifth, maximum]
planes["Price_Category"] = pd.cut(df["num_col"], labels=labels, bins=bins)
```

# Chapter 2

### Handle missing data

```
# Check missing data
df.isna().any()
df.isna().sum()
# Visualize missing data information
import missingno as msno
import matplotlib.pyplot as plt
msno.matrix(df)
plt.show()

# Drop missing data column
df_dropped = df.dropna(subset = ['col'], axis = 1) # 0 for row
df.dropna(axis = 0) # Drop entire row for missing value (default)
df.dropna(axis = 1) # Drop entire column for missing value

# Replace/impute missing data with single value
col_mean = df['col'].mean()
df_imputed = df.fillna({'col': col_mean})
df['col'].replace(to_replace=np.nan, value = some_mean,inplace = True) # Alternative
# Replace/impute missing data with series
series_imp = df['col1'] * 5
df_imputed = df.fillna({'col2':series_imp})

df["col"].value_counts() # Look out for suspicious values

##### Strategic dropping example ########
# Drop missing values where <= 5% of data in column are missing , otherwise impute values
threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)
cols_with_missing_values = df.columns[salaries.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
    df[col].fillna(df[col].mode()[0])
subgroup_dict = df.groupby("cat_col")["num_col"].median().to_dict()
df["num_col"] = df["num_col"].fillna(df["cat_col"].map(subgroup_dict))
```

# Chapter 3

### Date in pandas

```
# Way 1 : During import
df = pd.read_csv('filename.csv', parse_dates = ['date_col1', 'date_col2'])
# Way 2 : Parsing using pandas date format
df["date_col"] = pd.to_datetime(df["date_col"], 
                                format = "%Y-%m-%d %H:%M:%S",
                                errors='coerce')
# Way 3  : parsing using python date format
df["date_col"] = df["date_col"].dt.strftime("%d-%m-%Y")
# Extract month information
df["date_col"].dt.month
# Extract year information
df["date_col"].dt.year

# Resampling date
df.resample('M', on = 'date_col')['col1'].mean()
# Resampling count
df.resample('M', on = 'date_col').size()
# Add timezone in a datetime column
df['date_col'] = df['date_col'].dt.tz_localize('America/New_York', ambiguous='NaT')
# Convert to another timezone
df['date_col'] = df['date_col'].dt.tz_convert('Europe/London')
```

### Date in python

```
from datetime import date
from datetime import datetime
# Create date
d =  date(2017, 6, 21) # ISO format: YYYY-MM-DD
# Create a datetime
dt = datetime(year= 2017 , month= 10 , day= 1 , hour= 15 , minute= 23 , second= 25 , microsecond= 500000 )
# Change value of existing datetime
dt_changed = dt.replace(minute=0, second=0, microsecond=0)
# Sort date
dates_ordered = sorted(date_list)
# Parse datetime
dt = datetime.strptime("12/30/2017 15:19:13", "%m/%d/%Y %H:%M:%S")
d.isoformat() # Express the date in ISO 8601 format
print(d.strftime("%Y/%m/%d")) # Print date in Format: YYYY/MM/DD
print(dt.strftime("%Y-%m-%d %H:%M:%S")) # Print datetime in specific format
# Extract information
d.year # Extract year
d.month # Extract month
d.day # Extract day
d.weekday() # Extract weekday
##### Date addition / subtraction
from datetime import timedelta
delta = d2 - d # Subtract two dates
delta.days # Elapsed time in days
delta.total_seconds() # Elapsed time in seconds
td = timedelta(days=29) # Create a 29 day timedelta
print(d + td) # Add delta with existing date
# timestamp value
ts = 1514665153.0
# Convert to datetime from timestamp and print
print(datetime.fromtimestamp(ts))
```

### Correlation

```
# Co-efficient and p-value
from scipy import stats
pearson_coef, p_val = stats.pearsonr(df["col1"], df["col2"])
# Visualize correlation
import seaborn as sns
sns.lmplot(x="col1", y="col2", data=df, ci=None)
plt.show()

# Find correlation among all columns in a dataframe
correlations = df.corr() 
# Use heatmap for correlation visualization (Scatterplot is not a good choice for dataframes with more than 2 variables)
sns.heatmap(correlations, annot=True) # use cmap='coolwarm' to provide color map
# Bend x label ticks 45 degree to avoid overlappings with each-other
plt.xticks(rotation=45) 
plt.title('Correlations')
plt.show()
```

### KDE plot

```
sns.kdeplot(data=df, x="num_col", hue="cat_col", cut=0, cumulative=False)
plt.show()
```

# Chapter 4

- Generate hypothesis based on assumption of correlation
- Find feasibility of assumptions with hypothesis testing to reach conclusive evidence