## 3.4.1 Discussion Forum Activity: Understanding Data Characteristics

For this course, we will learn how to explore data characteristics using Python software as the tool running on the Ed platform. We will use the customer churn datasets to exercise the exploration methods practically. There are three initial raw datasets as described in the metadata presented in Week 2:

- Customer: stored in customer.xlsx file under the /data/ folder.
- Churn: stored in the churn.xlsx file under the /data/ folder.
- Payment transaction: stored in the ptransaction.xlsx under the `/data/` folder.

We have learned the following characteristics in the previous section, and now we will use Python programming codes to implement the data exploration method. In the Python programming environment, we must first access the datasets before exploring the data. The following codes give examples of accessing the datasets we want to explore, via `DataFrame` using the `pandas` library to contain the datasets.

In [2]:
import pandas as pd

df1 = pd.read_excel ('data/customer.xlsx')
df2 = pd.read_excel ('data/churn.xlsx')
df3 = pd.read_excel ('data/ptransaction.xlsx')

This course focuses on the concepts of the data mining process and uses a case study to apply and implement concepts using the Python software tool. Explanation of Python programming concept is beyond the scope of this course. However, students should have the knowledge and skills in Python programming as the pre-requisite of this course specifically for the concept implementation.

### Dataset Dimension: 

To understand a dataset dimension, we explore the data to observe the number of instances (a.k.a. records or rows) and attributes (a.k.a. columns or features). Add the following Python codes after the existing lines (or you can always start the new lines after a few empty lines to the main.py file.

In [3]:
# to display dataset dimension in row and column
print("Customer dataset: ", df1.shape)
print("Churn dataset: ",df2.shape)
print("Payment transaction dataset: ",df3.shape)

Customer dataset:  (1000, 6)
Churn dataset:  (504, 1)
Payment transaction dataset:  (998, 9)


The shape function displays the number of rows (a.k.a records or samples) and columns (a.k.a. data fields or attributes) of the respective datasets. 

After running the program, we should observe, for example, that the Customer dataset contains 1000 records and six attributes and the Churn dataset has only one attribute with 503 records.

### Data Domain: 
To get the information of an attribute's data type and check if the type and its values match the expectation based on the domain knowledge. We can use the info() function as follows:

In [4]:
# to display dataset column names, #non-null, data type
print(df1.info())
print(df2.info())
print(df3.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   CustomerId  1000 non-null   int64         
 1   Firstname   1000 non-null   object        
 2   Gender      999 non-null    object        
 3   PostalCode  1000 non-null   int64         
 4   HashCode    1000 non-null   object        
 5   Birthdate   999 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 47.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   CustomerId  504 non-null    int64
dtypes: int64(1)
memory usage: 4.1 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------      

To understand the results, take the example of the Customer dataset, it should be observed that it contains 8 attributes (i.e., `ustomerId`, `Firstname`, `Gender`, `PostalCode`, `HashCode`, `Birthdate`). All the attributes do not contain any NULLvalues (i.e. indicated with 1000 non-null) except the `Birthdate` and `Gender` attributes have a record of having a `NULL` value, which, conversely, `999` records have a birth date value. We do not know why there is a `NULL` value in both attributes may be due to an error or no available information at the time of data entry. To deal with this `NULL` value, we will study the possible methods in the next section of Data Pre-processing Methods.

The `info()` function also represents the data type information, for example, the `Customer` dataset has the data types of attributes as follows:

```
...[results above are ignore here]...
dtypes: datetime64[ns](1), int64(2), object(3)
...[results below are ignore here]...
```

The result above shows that there is one attribute (`Birthdate`) having DateTime type, two (`PostalCode` and `CustomerId`) integers (i.e. `numbers`), and three (`Firstname`, `Gender`, and `HashCode`) objects. If we refer to the metadata of the datasets described in the Week 2 topic for data understanding based on domain knowledge, we discover that there are unmatched data types. The following table shows the matched and unmatched data types according to domain knowledge:



![problem data understanding](attachment:learning_materials/DM - problem data understanding.png)

Observe the `Churn` and `Payment` transaction datasets' attributes and their data types. Refer to the metadata presented in Week 2, and use the example codes given earlier to identify the unmatched data types as presented in Figure 2.