<img src="https://www.mmu.edu.my/fci/wp-content/uploads/2021/01/FCI_wNEW_MMU_LOGO.png" style="height: 80px;" align=left>  

# Learning Objectives

Towards the end of this lesson, you should be able to:
- perform preliminary investigation on dataset



---



### For Google Colab Use Only
Skip this section if you are using Jupyter Notebook etc.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
drive_path = '/content/drive/MyDrive/Trimester/2310/TDS3301/Tutorials/Tutorial 2/' #set your google drive path

---

In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

# Read data

In [None]:
try:
  drive_path
except NameError:
  drive_path = ''

df = pd.read_csv(drive_path + "BigMartSales/train.csv")

In [None]:
# check dimensionality
print("Number of rows:", df.shape[0])
print("Number of features/columns:", df.shape[1])

In [None]:
df.tail(5)

## Checking data types (attribute types)
**It's important to check data types to make sure they are correct. Sometimes a numeric column can be treated as an object type because there are junk text mixed in the data...**

In [None]:
df.dtypes # df.info() works too

**If that is the case, you can manually convert them to the proper data type**

In [None]:
# example
df["Item_Weight"] = pd.to_numeric(df["Item_Weight"], errors="coerce")
df["Outlet_Establishment_Year"] = pd.to_numeric(df["Outlet_Establishment_Year"], errors="coerce")
df['Outlet_Establishment_Year'] = pd.to_numeric(df['Outlet_Establishment_Year'], downcast='float')
# errors="coerce" will make sure any non-numeric will be converted into NaN

# If your data has datetime object, can use pd.to_datetime() to convert to proper type.

In [None]:
df.dtypes

# Missing data
**To check if there is any missing data:**

In [None]:
# Change the .sum() to .mean()*100 if you prefer it in %
# You can even plot it out if you want, might be useful if there are too many features.
df.isna().sum()

In [None]:
len(df[df.isna().any(axis=1)])

In [None]:
# At this point, you can either drop all rows containing NA or impute them (https://scikit-learn.org/stable/modules/impute.html).
# There are different types of missing data (Missing at random, missing completely at random ...etc)
# Deal with NA accordingly

#df = df.dropna() # to drop all na
#df = df.fillna(df.median()) # fill NA with median of each feature

# Removing duplicated Data

In [None]:
print("Total duplicated rows: ", sum(df.duplicated()))

# drop duplicates
df = df.drop_duplicates() # or df = df[~df.duplicateed()]

# Measuring Central Tendency (Mean, median, mode)

In [None]:
# All in one except "mode", also including quartile range, standard deviation and min max.
df.describe() # by default only returns numeric type columns, use the parameter include="all" to include all dtypes

In [None]:
# To get the mean, median and mode of a feature, you can use pandas .mean() .median() or .mode() function
# Example:

print("Mean: ", df["Item_Outlet_Sales"].mean())
print("Median: ", df["Item_Outlet_Sales"].median())
print("Mode: ", df["Item_Outlet_Sales"].mode().tolist()) # mode might return more than 1 value, eg pd.Series([1,1,2,2,3,3]) returns 1,2,3

In [None]:
# These measures can also be useful in descriptive analytics, for instance

# Get the mean/average sales in 2009 by item_type
filtered = df[df["Outlet_Establishment_Year"] == 2009]
display(filtered.groupby(["Item_Type"]).agg({"Item_Outlet_Sales": "mean"}))

# Mode can be used to extract the most frequent data
# Or you can use df[col].value_counts() and the first item is the mode

# Visualizing distribution of data
**Simple plots can be done fast using pandas .plot(), alternatively seaborn is also quite good**

## Histogram

In [None]:
df["Item_Outlet_Sales"].plot(kind="hist", bins=10)

## Histogram + density plot

In [None]:
ax = df["Item_Outlet_Sales"].plot(kind="hist")
df["Item_Outlet_Sales"].plot(kind="kde", ax=ax, secondary_y=True)

In [None]:
# Same output but using seaborn
sns.displot(df["Item_Outlet_Sales"], bins=10, kde=True)

### We can quantify skewness by:

In [None]:
# We can quantify skewness by
print("Skewness: ", df["Item_Outlet_Sales"].skew())
# This shows that this variable is highly skewed to the right (positive skewness)

### Transformation to reduce skewness
- Common transformations are log, square root, or cube root to reduce positive skewness
- If it is negatively skewed, you can use log, cube root or square transformation

In [None]:
sqrt_transform = np.sqrt(df["Item_Outlet_Sales"])
sns.displot(sqrt_transform, bins=10, kde=True)
print("Skewness: ", sqrt_transform.skew())

## Q-Q Plot / Normality test

In [None]:
import statsmodels.api as sm
from scipy import stats
from scipy.stats import normaltest # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html

In [None]:
test = np.random.normal(0,1, 1000) # generate random data

sns.displot(test, kde=True)

#### QQ plot to visualize normality

In [None]:
_ = sm.qqplot(test, line='45')

#### Alternatively, a normality test

In [None]:
# normality test, it's more efficient to do a normality test rather than plot q-q plot for all features in your dataset, especially when your data has a lot features.
k2, p = normaltest(test)
alpha = 1e-3
print("p = {:g}".format(p)) # a big p value means it's close to normal

if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected")
else:
    print("The null hypothesis cannot be rejected")

## Bar plot

In [None]:
df["Item_Type"].value_counts().plot(kind="bar")
# using groupby works too: df.groupby("Item_Type").size().sort_values(ascending=False).plot(kind="bar")

## Box Plot

In [None]:
# Boxplot on a numeric feature
df[["Item_Outlet_Sales"]].plot(kind="box")

In [None]:
# box plot to show distributions with respect to categories

fig, ax = plt.subplots(figsize=(11.7, 8.27))
sns.boxplot(data=df, x='Outlet_Type', y='Item_Outlet_Sales', ax=ax)

## Correlation plot

In [None]:
corr = df.corr()

plt.figure(figsize=(7,7))
sns.heatmap(corr, vmax=.8, square=True, annot=True, fmt= '.2f',
            annot_kws={'size': 15}, cmap=sns.color_palette("Blues"))

## Scatterplot
#### * We'll use iris dataset here

In [None]:
iris = datasets.load_iris()

In [None]:
df2 = pd.DataFrame(data=iris["data"], columns=iris["feature_names"])
df2["target"] = [iris["target_names"][i] for i in iris["target"]]

In [None]:
sns.scatterplot(x="sepal length (cm)", y="petal length (cm)", data=df2)

In [None]:
sns.scatterplot(x="sepal length (cm)", y="sepal width (cm)", data=df2)

In [None]:
sns.scatterplot(x="petal length (cm)", y="sepal width (cm)", data=df2)

In [None]:
corr = df2.corr()

plt.figure(figsize=(7,7))
sns.heatmap(corr, vmax=.8, square=True, annot=True, fmt= '.2f', annot_kws={'size': 15}, cmap=sns.color_palette("Reds"))