# Lab | Revisiting Machine Learning Case Study
- In this lab, you will use learningSet.csv file which you already have cloned in today's activities.

### Instructions
Complete the following steps on the categorical columns in the dataset:

- Check for null values in all the columns

- Exclude the following variables by looking at the definitions. Create a new empty list called drop_list. We will append this list and then drop all the columns in this list later:

    - OSOURCE - symbol definitions not provided, too many categories
    - ZIP CODE - we are including state already
- Identify columns that over 85% missing values

- Remove those columns from the dataframe

- Reduce the number of categories in the column GENDER. The column should only have either "M" for males, "F" for females, and "other" for all the rest

    - Note that there are a few null values in the column. We will first replace those null values using the code below:
        print(categorical['GENDER'].value_counts())
        categorical['GENDER'] = categorical['GENDER'].fillna('F')

## Understand The Problem

2 Problems are raised:
    
    1) Donate or not donate --> Binary Classification "target_b"

    2) The amount of donation --> Linear Regression "target_d"

Goal:
- Target the high amount of donation group (consider cost-effective)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

## Import Data

In [None]:
df = pd.read_csv("/Users/kt/Desktop/Ironhack/Data-Analytics-Ironhack/unit-7/learningSet.csv")
df.head()

## Exploratory Data Analysis (EDA)

In [None]:
# Check rows, columns
df.shape

In [None]:
# Standardize columns' name
df.columns = [df.columns[i].lower().replace(" ","_") for i in range(len(df.columns))]
# df.columns

In [None]:
# Check dtypes --> float64(97), int64(310), object(74)
df.info()

In [None]:
# Check NaN & whitespace & duplicate rows in data frame

def data_explore(data): # sum & returns duplicates, NaN & empty spaces
    dup_rows = data.duplicated().sum()
    nan = data.isna().sum()
    empty = data.eq(' ').sum()
    explore = pd.DataFrame({"NaN": nan, "EmptySpaces": empty}) # New dataframe with the results
    print(f"There are {data.duplicated().sum()} duplicate rows. Also;")
    return explore

data_explore(df)

### Replace empty spaces with NaN

In [None]:
# We see a lot of empty space in the data frame. We can replace those with NaN
df = df.replace(r'^\s*$', np.nan, regex=True)

data_explore(df)

### Explore how much are those NaN (%)

In [None]:
nulls = pd.DataFrame(df.isna().sum()*100/len(df), columns=['percentage'])
nulls.sort_values('percentage', ascending = False)

In [None]:
# Identify columns that over 85% missing values

# Get nulls with >85%
df_nan = nulls[nulls["percentage"] > 85] 

# Get the columns' name
nan85 = df_nan.index.tolist()

# nan85
# len(nan85)

In [None]:
# Remove all those columns (nan > 85%) from the data frame

df = df.drop(nan85, axis=1)

In [None]:
# Delete 55 columns --> We have 426 columns left
df.shape

## Explore Target
### Check Target_D (Donate vs Not Donate)
0 - No donate

1 - Donate

In [None]:
print(df.target_b.value_counts())

sns.countplot(df.target_b)

### Check Taget_B (The amount of donation)

We see there's huge gap between the 2 groups of donation. The donate group is about 5% of the data whereas around 95% are non-donate. 

Since we're interested in the donation group. We will take a look how much the donors spent. 
As the marketing has a cost (0.68/mail), we want to focus on the high donate group. 

In [None]:
# Visualize donate group
sns.distplot(df_donate)
print(df_donate.describe(), "\n")
# print(df_donate.value_counts())
plt.xlabel("Amount (dollars)")
plt.ylabel("Density")

- From the plot & describtive stats, we see that average amount of donation is about 15 dollars, max is 200.
- We have so small amount of donors with high amount of donation.

In [None]:
# Following lab instruction --> select categorical variable for this lab
df_cat = df.select_dtypes(object)
df_cat.head()

### Lab instruction

Exclude the following variables by looking at the definitions. Create a new empty list called drop_list. We will append this list and then drop all the columns in this list later:

    - OSOURCE - symbol definitions not provided, too many categories
    - ZIP CODE - we are including state already

In [None]:
drop_list = ["osource", "zip_code"]

### Reduce the number of categories in the column GENDER.

In [None]:
# Check values in GENDER
df_cat["gender"].value_counts()

In [None]:
# Chaeck NaN
# df_cat["gender"].isna().sum()

In [None]:
# Fill NaN with "U" - Unknown
df_cat["gender"] = df_cat["gender"].fillna("U")

# df_cat["gender"].value_counts()

In [None]:
df_cat["gender"] = df_cat["gender"].apply(lambda x: "F" if x == "F" else ("M" if x == "M" else "other"))
df_cat["gender"].value_counts()