In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

## Exploratory Data Analysis

<span style="font-size:15px;"> Exploratory Data Analysis or in short EDA is a process of understanding the data using statistical and visualization techniques. It is important part of data analysis.
The next steps in data analysis like cleaning and modeling depends in EDA. It is important to spend time to explore data and extract hidden patterns in the data. 

There are mainly two types of EDA

* Univariate Analysis
* Bi-variate Analysis
</span>

### **Univariate Analysis**

Univariate analysis is a type of exploratory analysis where we will analyze the features independently. We will use different visualization techniques and statistical techniques to understand each feature in the data. <br><br>

We will use visualization techniques like histograms, density plot and box plots to understand the distribution of the data. Type of the plot to use depends on type of variable. If we need to check the distribution of categorical variable we need to use bar plot and if its a continuous variable we will use histograms. <br><br>

We can use statistical quntities line mean , median , percentile and skewness can be used to understand each feature.

<span style="font-size:15px;"> Lets load the data First </span>

In [None]:
#Loading Files
train_data = pd.read_csv("/kaggle/input/ventilator-pressure-prediction/train.csv")
test_data = pd.read_csv("/kaggle/input/ventilator-pressure-prediction/test.csv")

In [None]:
print("Train data shape ",train_data.shape)
print("Test data shape  ",test_data.shape)

In [None]:
train_data.info()

In [None]:
train_data.iloc[:,2:].describe()

In [None]:
train_data.isnull().sum()

**Inferences:**
* All the features are in int or float type. Although features R and C are int type they have only few categories , so we can treat them as categorical features
* From the statistics we can see there are outliers in some features. feature u_in mean is 7.3 and max value is 100 so its skewed and there are outliers
* There are no null values in the data

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(8,4))
fig.suptitle("Distribution of variable R",fontsize=15,)
test_data["R"].value_counts().plot.bar(ylabel="Count",xlabel="R",title="Test Data",ax=ar[0])
train_data["R"].value_counts().plot.bar(ylabel="Count",xlabel="R",title="Train Data",ax=ar[1])

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(8,4))
fig.suptitle("Distribution of variable C",fontsize=15,)
test_data["C"].value_counts().plot.bar(ylabel="Count",xlabel="C",title="Test Data",ax=ar[0])
train_data["C"].value_counts().plot.bar(ylabel="Count",xlabel="C",title="Train Data",ax=ar[1])

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(8,4))
fig.suptitle("Distribution of variable u_out",fontsize=15,)
test_data["u_out"].value_counts().plot.bar(ylabel="Count",xlabel="u_out",title="Test Data",ax=ar[0])
train_data["u_out"].value_counts().plot.bar(ylabel="Count",xlabel="u_out",title="Train Data",ax=ar[1])

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(12,6))
fig.suptitle("Distribution of variable time_step",fontsize=15,)
sns.distplot(train_data["time_step"],bins=100,ax = ar[0])
sns.boxplot(y=train_data["time_step"],ax=ar[1])

In [None]:
m  = train_data["time_step"]
m = np.log(m, out=np.zeros_like(m), where=(m!=0))

fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(12,6))
fig.suptitle("Distribution of variable time_step",fontsize=15,)
sns.distplot(m,bins=100,ax = ar[0])
sns.boxplot(y=m,ax=ar[1])

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(12,6))
fig.suptitle("Distribution of variable u_in",fontsize=15,)
sns.distplot(train_data["u_in"],bins=200,ax = ar[0])
#ar[0].xaxis.zoom(-3)
ar[0].set_xlim([-5, 20])
sns.boxplot(y=train_data["u_in"],ax=ar[1])

In [None]:
fig,ar = plt.subplots(nrows=1,ncols=2,figsize=(12,6))
fig.suptitle("Distribution of variable Pressure",fontsize=15,)
sns.distplot(train_data["pressure"],bins=200,ax = ar[0])
sns.boxplot(y=train_data["pressure"],ax=ar[1])

**Inference:**

* Train and Test data has similar distribution
* Variables "R" and "C" are categorical variables. We can use One-Hot or Label encoding to preprocess these columns before building model
* "time_step" feature is uniformly distributed. From boxplot we can see there are no outliers. By using log transformation we can convert "time_step" to normal distribution
* "u_in" feature has too many outliers. Try to handle outliers in preprocessing. And also values of "u_in" are concentrated at 0-1 and 4-5.
* Target variable "Pressure" is normally distributed. Most of the values lies between 1 and 10

**Bivariate Analysis:**

Bivariate Analysis is analyzing and finding the relation or association between two features. We can use visualization techniques like scatter plots, pair plots and bar plots. Choice of these plots might depend on the type of target variable. We can use statistical techniques such as hypothesis testing, calculating Correlation etc.  to understand the association between two features.

We can perform Bi-variate analysis for any continuous and categorical combinations. Categorical & Categorical, Categorical & Continuous and Continuous & Continuous.

Continuous & Continuous:  we can use scatter plots and correlation to test the association.<br>
Categorical & Categorical: We can use frequency tables, stacked column charts and chi-square tests <br>
Categorical & Continuous: We can use box plots , z-test/t-test and ANOVA <br>

<span style="font-size:15px;"> As Data size is huge, using entire data for plotting scatter plots is very time consuming and also the plot will be very clumsy. So lets sample 20% of the data and use that for our EDA </span>

In [None]:
train, val = train_test_split(train_data, test_size=0.2, random_state=42 , shuffle=True)

Lets understand the realtion between target variable and u_in . I am using "R" feature to color the points. We can see how the relationship between our target and u_in for each class of u_in

In [None]:
sns.scatterplot(data=val,x="u_in",y="pressure",hue="R")

As we can see scatter plot is very clumsy. There are some points hiding other points. Lets plot seperate plot for each class of variable "R"

In [None]:
fig,ax = plt.subplots(nrows=2,ncols = 2,figsize=(20,30))
sns.scatterplot(data=val.iloc[np.where(val["R"]==50)],x="u_in",y="pressure",ax=ax[0,0])
ax[0,0].set_title("R = 50")
sns.scatterplot(data=val.iloc[np.where(val["R"]==20)],x="u_in",y="pressure",ax=ax[0,1])
ax[0,1].set_title("R = 20")
sns.scatterplot(data=val.iloc[np.where(val["R"]==5)],x="u_in",y="pressure",ax=ax[1,0])
ax[1,0].set_title("R = 5")

Similarly Lets plot scatter plot between pressure and u_in for different classes of varibale "C"

In [None]:
fig,ax = plt.subplots(nrows=2,ncols = 2,figsize=(20,30))
sns.scatterplot(data=val.iloc[np.where(val["C"]==10)],x="u_in",y="pressure",ax=ax[0,0])
ax[0,0].set_title("C = 10")
sns.scatterplot(data=val.iloc[np.where(val["C"]==20)],x="u_in",y="pressure",ax=ax[0,1])
ax[0,1].set_title("C = 20")
sns.scatterplot(data=val.iloc[np.where(val["C"]==50)],x="u_in",y="pressure",ax=ax[1,0])
ax[1,0].set_title("C = 50")

In [None]:
sns.boxplot(data=val, x="R", y = "pressure")

In [None]:
sns.boxplot(data=val, x="C", y = "pressure")

In [None]:
sns.boxplot(data=val, x="u_out", y = "pressure")

**Inferences:**
* There is not much correlation between u_in and pressure. From the plots of different classes in R and C, We can see there is little negative correlation between u_in and pressure for values R=50 and C=10
* From the box plots we can see the distribution of pressure accross different categories of R and C is same