# 4.1 Data Exploration 

Note: Credit Card Fraud dataset is downloaded from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download

## Basic Exploration

1. Import `pandas` library
2. Import dataset

In [None]:
import pandas as pd

df = pd.read_csv('../datasets/creditcard.csv')

3. Check the number of rows and columns in the loaded data

Output: (row numbers, column numbers)

In [None]:
df.shape

4. Print information about a DataFrame including the index dtype and columns, non-null values and memory usage.

All columns say float64 next to them, indicating they are of float data type, that is, numbers such as ..., -0.4, 1.29, 0.0, 1.5, 3.5,... . The exception is the Class column. This contains integer values that indicate if a transaction is fraudulent or not. 

In [None]:
df.info()

5. Print the first few rows of the DataFrame

In [None]:
df.head()

5.1 Print more...

In [None]:
df.head(10)

6. Print last few rows

In [None]:
df.tail()

7. Print columns of the dataframe

In [None]:
df.columns

## Further Exploration

1. View the top n rows of the specific column

In [None]:
df[['Class', 'Time', 'Amount']].head()

2. View last few rows of Dataframe

In [None]:
df[['Class', 'Time', 'Amount']].tail()

3. Check the unique values of the class. 

The `Class` column, which decides if a transaction is fraudulent or not, should contain only two possible values (0,1), as we were informed by the client. It is useful to check the unique values of the class. 

Count the number of unique values using the .nunique() method on the Series (column) Class. 


In [None]:
df['Class'].nunique()

4. Check what these unique values are

After we know there are two unique values for the Class column, we can check what these unique values are:

[0 1] that shows there are two unique values for the Class column that are 0 and 1. 

In [None]:
df['Class'].unique()

5. Find and store value counts

Also, we can use the `.value_counts()` method on the `Class` Series to know the value of Class and how often they occur.  We will perform this operation in this step and store the value counts in the `class_counts` variable.

Store the value counts in the variable defined as `id_counts` and then display the stored values

In [None]:
class_counts = df['Class'].value_counts() 
print(class_counts)

Based on the previous output, it shows all the class values have been reported by the client and there is no other value that is not defined by the data dictionary. 

6. Check rows that have missing values for specific features.

There are several ways to check missing values. Although it is not within the context of this subject to dive into details of how to deal with missing values, we just later mention an example of missing values and a possible way to explore and check them. 

We can create a class mask

In [None]:
class_mask = df['Class'].isnull()
print(class_mask[0:5])

Output shows that `isnull()` method returns `False` if the value is missing for a specific row for a specific feature (`Class` in this case). Similarly, `isnull()` method returns `True` if the value is missing for a specific row for a specific feature.

You can also count the number of missing values

In [None]:
print(sum(class_mask))

- `notnull()` returns `True` : the value is not missing for a specific row for a specific feature (`Class` in this case)
- `notnull()` returns `False` : the value is missing for a specific row for a specific feature,

In [None]:
class_mask_notNull = df['Class'].notnull()
print(class_mask_notNull[0:5])
print(sum(class_mask_notNull))

7. Generate a tabular report of summary statistics of specific features

Outpur shows 
- count of Amount feature values
- mean
- standard deviation std
- minimum value min  
- maximum value max 
 
We can notice that there are `284807` values. The minimum amount is `0.0` and the maximum is `25691.16`. 

Analysing these values could provide good indicators on the feature itself and if it could help to achieve the goal of the business problem (e.g. predicting if credit transaction is fraudulent based on the amount spent). 

In [None]:
df[['Amount']].describe()