# Data Exploration

* **Explore data frame**

* **Explore missing values**

* **Explore outliers**

* **Explore univariate distributions**

* **Explore bivariate relationships**

* **Explore multivariate relationships**

---

### **Import Python Libraries**

In [67]:
import pandas as pd
import numpy as np

In [68]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# This line allows for inline plotting in Jupyter Notebooks, displaying plots directly below the code cells.
%matplotlib inline

In [69]:
import seaborn as sns

### **Run helper functions**

In [70]:
%run __functions__.ipynb

### **Load data**

> For detailed dataset description please see: [Kaggle](https://www.kaggle.com/c/titanic/data)

In [71]:
df = pd.read_csv("data/Titanic/train.csv")

### **Explore data frame**

> - **Rows**
> - **Columns**
> - **Column types**
> - **Examples**

In [72]:
explore_dataframe(df)


Shape of the DataFrame: (891, 12) 

df contains 891 rows.
df contains 12 columns: 

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

First record:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S



Last record:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


##### Change column types
  
>Pandas Data Frames can handle a variety of data types in their columns.
>These data types are usually based on NumPy's data types, as Pandas is built on top of NumPy.
>Here's a summary of the common data types that columns in a Pandas DataFrame can have:

**1. Numeric types**  
<u>int64</u>: Integer data type for 64-bit integers.  
<u>float64</u>: Floating-point data type for 64-bit floats.  
<u>complex128</u>: Complex number data type with 128-bit precision.  

**2. Boolean type**  
<u>bool</u>: Boolean data type, representing True or False values.  

**3. Object type**  
<u>object</u>: A general-purpose data type for arbitrary Python objects, often used for strings or mixed data types. Columns of this type can hold any data type, but it's most commonly used for text data.  

**4. String type**  
<u>string</u>: A specialized string data type, introduced in Pandas 1.0, that is more efficient and consistent than using object for strings.  

**5. Datetime and Timedelta types**  
<u>datetime64[ns]</u>: Date and time data type with nanosecond precision.  
<u>timedelta64[ns]</u>: Data type for time differences (durations) with nanosecond precision.  

**6. Categorical type**  
<u>category</u>: A data type for categorical variables, which can take on a limited and usually fixed number of possible values (categories). This is more memory-efficient than using object or string.  

Mapping dictionary

In [73]:
column_type_mapping = {
    "PassengerId": "int64",
    "Survived": "category",
    "Pclass": "category",
    "Name": "string",
    "Sex": "category",
    "Age": "float64",
    "SibSp": "category",
    "Parch": "category", 
    "Ticket": "string",
    "Fare": "float64",
    "Cabin": "string",
    "Embarked": "category"
}

In [74]:
df = change_column_types(df, column_type_mapping)

Column types before mapping
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object 

Column types after mapping
PassengerId             int64
Survived             category
Pclass               category
Name           string[python]
Sex                  category
Age                   float64
SibSp                category
Parch                category
Ticket         string[python]
Fare                  float64
Cabin          string[python]
Embarked             category
dtype: object


Select categorical columns

In [None]:
categorical_cols = df.select_dtypes(include="category").columns.to_list()

In [None]:
print(categorical_cols)

Select numerical columns

In [None]:
numerical_cols = df.select_dtypes(include="number").columns.to_list()

In [None]:
print(numerical_cols)

In [None]:
# Drop ID column
numerical_cols.remove("PassengerId")

In [None]:
print(numerical_cols)

##### Examples

Display n first/last rows

In [None]:
display(df.head(1))
display(df.tail(1))

Display sorted by column

In [None]:
display(df.sort_values(by=["Survived", "Name"], ascending=[False, True]).head(3))

### **Explore missing values**

### **Explore outliers**

In [None]:
explore_outliers(data=df, columns=numerical_cols)

### **Explore univariate distributions**

> - **<u>Discrete</u> distributions**
> - **<u>Continuous</u> distributions**

##### <u>Discrete</u> distributions

In [None]:
explore_univariate(
    data=df, columns=categorical_cols, dist_type="discrete", relative_frequency=True
)

##### <u>Continuous</u> distributions

In [None]:
explore_univariate(
    data=df,
    columns=numerical_cols,
    dist_type="continuous",
    n_bins=30,
    remove_outliers=True,
)

### **Excursus: Sweetviz Library**

In [None]:
%pip install sweetviz

In [None]:
import sweetviz as sv

In [None]:
#report = sv.analyze(df)
#report.show_html('sweetviz_report.html')

### **Explore bivariate relationships**

In [None]:
explore_bivariate_relationships(
    data=df,
    categorical_columns=categorical_cols,
    numerical_columns=numerical_cols,
    y_column="Age",
    y_type="continuous",
)

### **Explore multivariate relationships**