## 3.4.3 Understanding Data Statistics for Pre-processing
The statistics methods for understanding how the values in a dataset for an attribute distribute across the value range they can take are skewness, mean, min, max, and mode.


## Exploring Numerical Attributes' Data
To explore attributes with numerical data types, for example, using the `describe()` function to observe the basic statistics of all numerical attributes. 

We can also use statistical functions such as the `skew()` function to observe a specific attribute's skewness. Returning to the merged dataset, `CustomerChurn`, we can use the following codes for exploring data statistics:

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

df = pd.read_csv('data/CustomerChurn.csv')

# show basic statistics for numerical columns (count, mean, std, min, max, 25%, 50%, 75%)
print(df.describe())   

df1 = df.drop(['Gender', 'Churn'], axis=1)

# show skewness and kurtusis of a particular columns
print('-----------Skewness--------------')
print(df1.skew())

       Unnamed: 0   CustomerId         Age  PostalCode  MinTrxValue  \
count  996.000000   996.000000  996.000000  996.000000   996.000000   
mean   499.056225   500.228916   45.487952    4.616466    85.809498   
std    288.749769   288.614980   18.883213    1.692873   162.920287   
min      0.000000     1.000000    0.000000    1.000000     0.000000   
25%    248.750000   250.750000   29.750000    3.000000     0.297500   
50%    499.500000   499.500000   44.000000    5.000000     5.940000   
75%    748.250000   750.250000   58.000000    6.000000    79.940000   
max    999.000000  1000.000000   91.000000    9.000000   987.520000   

       MaxTrxValue  TotalTrxValue        Cash  CreditCard      Cheque  \
count   996.000000     996.000000  996.000000  996.000000  996.000000   
mean    242.864639     336.115994    0.424699    0.722892    0.269076   
std     173.963382     215.235056    0.494546    0.447795    0.443703   
min       0.880000       2.480000    0.000000    0.000000    0.00000

Generally, the skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. However, based on the `skew()` function results, we observe that the `MinTrxValue`, `MaxTrxValue`, `TotalTrxValue`, `CreditCard`, and `Cheque` attributes have a skewed distribution with skewness values not near zero when rounded to the numbers. 

There are many more statistical functions available in the Python programming environment. In addition, students can try out other functions as additional learning activities.

As we understand from the metadata described in Week 2, the `Cash`, `CreditCard`, and `Cheque` attributes' data value of 1 indicate if a customer used to pay by cash, credit card, or cheque, or 0. Therefore, since they only contain two values, we ignore the skewness findings but further explore to confirm if their values should fall within 0 or 1.

In the next Section 4.4, we will use the above skewness findings to explore further the data of the the MinTrxValue, `MaxTrxValue`, `TotalTrxValue`,`CCreditCard`, and `Cheque` attributes to detect if outliers exist.

### Exploring Categorical and Binary Attributes' Data
The Python function `value_counts()` shows the counts of each value of an attribute. We can also use the `value_count()` to identify if specific attribute data fall within the expected values or range. 

To explore attributes with character data types (i.e., categorical), for example, using the following Python functions to observe the categorical attributes `Gender`, `PostalCode`, and `Churn`,  we can use the following codes:

In [2]:
# show the values exist in the attribute and their counts
print(df['Gender'].value_counts())
print(df['Churn'].value_counts())
print(df['PostalCode'].value_counts())

# further exploration to check values
print(df['Cash'].value_counts())
print(df['Cheque'].value_counts())
print(df['CreditCard'].value_counts())

Gender
male        496
female      400
mänlich      51
weiblich     49
Name: count, dtype: int64
Churn
yes    502
no     494
Name: count, dtype: int64
PostalCode
4    242
5    205
6    162
3    153
2     79
7     78
8     39
1     23
9     15
Name: count, dtype: int64
Cash
0    573
1    423
Name: count, dtype: int64
Cheque
0    728
1    268
Name: count, dtype: int64
CreditCard
1    720
0    276
Name: count, dtype: int64


After running the above codes, we can observe that the `Gender` attribute contains four values, `male` (496 rows), `female` (400 rows), `mänlich` (51 rows), and `weiblich` (49 rows). As we learned that the original datasets were collected from Germany, the German terms mänlich and weiblich refer to males and females, respectively. Since they have the same meaning, we should fix this misclassification problem. 

The `Churn`, `PostalCode`, `Cheque`, and `CreditCard` attributes contain the data falling within the expected values. We expect the PortalCode attribute data values to fall within '1' to '9', the Churn attribute to have 'yes' or 'no', and both CreditCard and Cheque to have 0 or 1. Therefore, no further pre-processing is necessary for them.


### Treating Misclassification
To tackle the misclassification problem that exists in the Gender attribute, we will replace the value of  `mänlich` value to male, and `weiblich` to female using the following Python codes:

In [3]:
# to locate mänlich in Gender attribute and replace it with male
df.loc[df['Gender'] == 'mänlich', 'Gender'] = 'male' 

# to locate weiblich in Gender attribute and replace it with female
df.loc[df['Gender'] == 'weiblich', 'Gender'] = 'female'

To confirm the misclassification has been fixed, we observe the Gender attribute data again:

In [4]:
# to confirm the Gender attribute values are correct
print(df['Gender'].value_counts())

Gender
male      547
female    449
Name: count, dtype: int64


### Saving the Data after Pre-processing
After treating the misclassification problem in the Gender attribute, we save the data and name it ChurnProcessed.csv.

In [5]:
# save the data into a CSV file 
df.to_csv('data/ChurnProcessed.csv')