# EDA
Let's get started !!!
### 1. Importing the required libraries for EDA

In [None]:
import pandas as pd 
import numpy as np                     # For mathematical calculations 
import seaborn as sns                  # For data visualization 
import matplotlib.pyplot as plt        # For plotting graphs 
%matplotlib inline 
sns.set(color_codes=True)
import warnings                        # To ignore any warnings warnings.filterwarnings("ignore")

## 2. Loading the data into the data frame.

In [None]:
df = pd.read_csv("example.",delimiter=' ')
# To display the top 5 rows
df.head(5)

###  Firstly, we will check the features present in our data and then we will look at their data types.

In [None]:
df.columns

#### Given below is the description for each variable.

| Variable | Description | 
|---|---|
|||

### 3. Checking the types of data

Here we check for the datatypes because sometimes ***variable** would be stored as a string or object, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph.

In [None]:
# Checking the data type
df.dtypes

In [None]:
df.info()

### We can see there are three format of data types:
1. object: descrip
2. int64: descrip
2. float64: descrip

In [None]:
df.shape

### 4. Dropping irrelevant columns

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution.

In [None]:
# Dropping irrelevant columns
df = df.drop([ #column_names 
            ])

df.head(5)

### 5. Renaming the columns

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the readability of the data set.

In [None]:
# Renaming the column names
df = df.rename(columns={}
              )
df.head(5)

### 6. Dropping the duplicate rows

This is often a handy thing to do because a huge data set as often have some duplicate data which might be disturbing, so here I remove all the duplicate value from the data-set. 

In [None]:
# Total number of rows and columns
print(df.shape)

# Rows containing duplicate data
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
# Dropping the duplicates 
df = df.drop_duplicates()
df.head(5)

In [None]:
df.describe()

***Here, notices mean, medain(represented by 50%(50th percentile)) value of each column in index column. \
      notices difference between 75th %tile and max values of predictors \
      observed any extreme values-Outliers in our data set.***

## Next Section

## 7. Univariate Analysis

**In this section, we will do univariate analysis. It is the simplest form of analyzing data where we examine each variable individually. For categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.**

### Target Variable( if categorical)

In [None]:
df.Target.unique()

In [None]:
# We will first look at the target variable, i.e., Loan_Status. 
# As it is a categorical variable, let us look at its frequency table, percentage distribution and bar plot.

df['Target'].value_counts()

In [None]:
# Normalize can be set to True to print proportions instead of number 
df['Target'].value_counts(normalize=True)

In [None]:
df['Target'].value_counts().plot.bar()

### Target Variable( if Numerical)

In [None]:
plt.figure(1)
plt.subplot(121) 
sns.distplot(df['Target']); 
plt.subplot(122)
df['Target'].plot.box(figsize=(16,5)) 
plt.show()

### Independent Variable (if Categorical)

In [None]:
plt.figure(1) 

plt.subplot(221)
df['Variable'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Variable name') 
plt.subplot(222)
df['Variable'].value_counts(normalize=True).plot.bar(title= 'Variable name') 

plt.show()

### It can be inferred from the above bar plots that:

1. 
2. 

### Independent Variable (if Ordinal)

In [None]:
plt.figure(1) 

plt.subplot(221)
df['Variable'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Variable name') 
plt.subplot(222)
df['Variable'].value_counts(normalize=True).plot.bar(title= 'Variable name') 

plt.show()

### It can be inferred from the above bar plots that:

1. 
2. 

### Independent Variable (if Numerical)

In [None]:
plt.figure(1)
plt.subplot(121) 
sns.distplot(train['Variable']); 
plt.subplot(122)
train['Variable'].plot.box(figsize=(16,5)) 
plt.show()

### Following inferences can be made from the above dist  plots:

1. Most of the data in the distribution of is towards right / left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.

### After looking at every variable individually in univariate analysis, we will now explore them again with respect to the target variable(if Categorical).

## 8. Univariate Analysis(if target variable is Categorical)

### 8.1 Categorical Independent Variable vs Target Variable
First of all we will find the relation between target variable and categorical independent variables. Let us look at the stacked bar plot now which will give us the proportion.

In [None]:
Gender=pd.crosstab(train['Categorical'],train['Target']) 
Gender

In [None]:
Gender.div(Gender.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))

### Following inferences can be made from the above proportion plots:

1.

### 8.2 Numerical Independent Variable vs Target Variable


In [None]:
bins=[] # bin range

group=[] # group name

train['extra_col']=pd.cut(train['variable'],bins,labels=group)

Income_bin=pd.crosstab(train['extra_col'],train['variable']) 

Income_bin.div(Income_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True) 

plt.xlabel('') 
P = plt.ylabel('')

### Following inferences can be made from the above bar plots:

1. 

## 9. Univariate Analysis(if target variable is Numerical)


### Now lets look at the correlation between all the numerical variables.
### We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. 
### The variables with darker color means their correlation is more.


In [None]:
# Finding the relations between the variables.
plt.figure(figsize=(20,10))
matrix = train.corr()
sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu",annot=True);

### To check distribution-Skewness
sns.distplot(df[variable],kde=True) 

### 10. irrelevant columns which do not have any effect on the target variable

### After exploring all the variables in our data, Lets drop some variable as it do not have any effect on the target variable. 
### We will do the same changes to the test dataset which we did for the training dataset.
#### df=df.drop(['col_name'],axis=1) 


# Next Section

### 11. Dropping or Replacing the missing or null values.

This is mostly similar to the previous step but in here all the missing values are detected and are dropped/repalced later.\
**Dropping: this is not a good approach to do so, but it is done when there is is a small number and this is negligible.\
Replacing: Just replace the missing values with the mean or the median of that column.**

After exploring all the variables in our data, we can now impute the missing values and treat the outliers because missing data and outliers can have adverse effect on the model performance.

In [None]:
# Finding the null values.
print(df.isnull().sum())

We can consider these methods to fill the missing values:

***1. For numerical variables: imputation using mean or median***\
***2. For categorical variables: imputation using mode***

In [None]:
# Replacing the missing values with median.

df['categorical variables'].fillna(df['categorical variables'].mode()[0], inplace=True)

In [None]:
df['numerical variables'].fillna(df['numerical variables'].median(), inplace=True)

In [None]:
df.info()

### 12. Outlier Treatment
An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Hence it’s a good idea to remove them. The outlier detection and removing that I am going to perform is called IQR score technique. Often outliers can be seen with visualizations using a box plot.

As we saw earlier in univariate analysis, varaible can contains outliers so we have to treat them as the presence of outliers affects the distribution of the data. 

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

In [None]:
df['var_log'] = np.log(df['var'])

### 13.Get_Dummies

In [None]:
df=pd.get_dummies(df) 

# Next Section

### 14. Feature Engineering

Based on the domain knowledge, we can come up with new features that might affect the target variable. We will create the following three new features:

BY: adding/subtracting two columns
    

# Last Section

save csv  
df.to_csv('data_final.csv',index=False)