# Exploratory Data Analysis (EDA)
- Performed to gain better understanding of the data. Does not aim at building a model but to understand data before that
- Deals with identifying variables and understanding relationship between them

## Data, Data Types & variables
### Data Types
- Structured -> tabular, rows and columns
- Unstructured -> text, image, audio and video
### Variable Types
- Qualitative (non numerical)
    - nominal (ex, name, place etc)
    - ordinal (ex, categories that can be sorted (satisified, dissatisfied))
- Quantitative (numerical)
    - continuous values
    - discrete values

## Measure of Central tendency
- Mean 
    - Formula: (sum of items)/(total number of items)
    - Very sensitive to outliers

- Median
    - Formula: (total number of items + 1)/2 th item of the sorted list
    - Resistant to outliers
- Mode
    - Frequency of each item
    - Highly Resistant to outliers

## Measure of dispersion
- Variance
    - Formula: SUM(\[x1 - x\]^2)/total number of items
- Standard Deviation
    - Formula: SD = SQRT(Variance)
- Range
    - Formula: Max Value - Min Value
- Quertiles
    - Q1 
        - divides smallest 25% values from larger data
        - Formula: SUM(total number of items + 1)/4 th item of sorted list
    - Q2 
        - divides smallest 50% values from larger data 
        - Formula: SUM(total number of items + 1)/2 th item of sorted list
        - AKA Median
    - Q3
        - divides larger 75% values from rest
        - Formula: \[SUM(total number of items + 1)*3\]/4 th item of sorted list
    - IQR (Interquertile range)
        - Formula Q3-Q1
        - represents a box in box plot
    - outliers
        - 1.5 * IQR 
- Co efficient of variation
    - Greater the value = greater variabtion
    - Formula: CV = SD / Mean
    - Example: variance is housing price between carrollton and richland hills
- Z Score
    - How close is the obervation to the Mean
    - Formula: Z = (x1 - x)/SD

## Summarizing measured data

### Five Point Summary
- Min
- Q1
- Q2
- Q3
- Max

### Shape of data

# Symmentrical - normal distribution (bell curve)
# Skewness 
    - Positive / Right skewed: Median < Mean
    - Negative / Left skewed: Median > Mean
    - Formula: SUM\[(x1 - x)^3\]/[{n-1} * SD^3]
### Co variance
- Relationship between two variables
- Formula: cov(x,y) = SUM\[(x1-x) * (y1-y)\]/total no of items
- Positive: x and y goes in the same direction
- Negative: x and y goes in different direction
- Doesn't imply that  x and y influence each other

### Correlation
- Relationship between two variables independant of unit (or scale)
- Formula:  corr = cov(x,y)/SDx * SDy
- Values always in rangeL -1 to 1
- Value closer to -1 and 1 -> stronger correlation
- Value closer to 0 -> weaker correlation


    
        





### Generic
    - data.shape
    - data.info()
    - data.describe()
    - len(data)    
    - data.dtypes.value_counts() -- count data types
### Measures
    - data.mean() | data\['col1'\].mean()
    - data.median() | data\['col1'\].median()
    - data\['col1'\].mode()
    - data.quantile(0.25) | data.quantile(0.50) | data.quantile(0.75) 
    - IQR = data.quantile(0.75) - data.quantile(0.25)
    - Range = data.max() - data.min()
    - Variance = data.var()
    - SD = data.std()
    - Co variance = data.cov()
    - Correlation = data.corr()
    - data.skew()

### Cleaning data
    - data.drop('col1', axis = 1, inplace = True)
    - data.dropna()
    - data.isnull().sum()
### Handling Non numerical data
    - One Hot Encoding
        - df_dummies= pd.get_dummies(data1, prefix='Park', columns=['ParkingArea']) #This function does One-Hot-Encoding on categorical text
    - Sklearn OneHotEncoder
        - from sklearn.preprocessing import OneHotEncoder
        - hotencoder = OneHotEncoder()
        - encoded = hotencoder.fit_transform(df_dummies.RegionId.values.reshape(-1,1)).toarray() 
     
### Normalization & Scaling
    - StandardScaler (normalizes using z score)
        - from sklearn.preprocessing import StandardScaler
        - std_scale = StandardScaler()
        - std_scale
    - MinMaxScalar (normalizes using (x - min)/(max - min))
        - from sklearn.preprocessing import MinMaxScaler
        - minmax_scale = MinMaxScaler()
        - minmax_scale
    - Log Transformation
        - Used to Transformation large variances into smaller (zoom out)
            - import numpy as np
            - from sklearn.preprocessing import FunctionTransformer   
            - log_transformer = FunctionTransformer(np.log1p)
            - log_transformer
    - Exponential Transformation
        - Used to transform densly populated observation to larger (zoom in)
            - exp_transformer = FunctionTransformer(np.exp) # Exponential transform 
            - exp_transformer

### Duplicate values
    - dupes = Data.duplicated()
    - sum(dupes)
    - dupes = Data.drop_duplicates()

### Missing values
    - Standard missing values - NaN
        - Data['Col1'].isnull()
        - Data.isnull().values.any()   # Any of the values in the dataframe is a missing value
        - Data.isnull().sum().sum()    # Total number of recognised missing values in the entire dataframe
        - Data['Number'].fillna(12345, inplace = True)   

    - Non standard missing values - bad data or space
        - Need to identify based on the context
    - Replacing missing values
        - Replace with mean
        - Replace with median
        - Replace with static value
        - Self heal the value based on the context
        - Data.dropna(inplace=True)
        
## Correcting outliers
1. remove using z-score
    - from scipy import stats
    - import numpy as np
    - z = np.abs(stats.zscore(boston_df))   # get the z-score of every value with respect to their columns
    - print(z)
    - threshold = 3
    - np.where(z > threshold)
    - boston_df1 = boston_df\[(z < 3).all(axis=1)\]  # Select only the rows without a single outlier
    - boston_df2 = boston_df.copy()   #make a copy of the dataframe
    - Replace all the outliers with median values. This will create new some outliers but, we will ignore them
    - for i, j in zip(np.where(z > threshold)\[0\], np.where(z > threshold)\[1\]):# iterate using 2 variables.i for rows and j for columns
    - boston_df2.iloc\[i,j\] = boston_df.iloc\[:,j\].median()  # replace i,jth element with the median of j i.e, corresponding column
    
    
2. remove using IQR
    - Q1 = boston_df.quantile(0.25)
    - Q3 = boston_df.quantile(0.75)
    - IQR = Q3 - Q1
    - print(IQR)
    - np.where((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR)))

    - boston_df_out = boston_df\[~((boston_df < (Q1 - 1.5 * IQR)) |(boston_df > (Q3 + 1.5 * IQR))).any(axis=1)\] # rows without outliers
    - boston_df4 = boston_df.copy()

    - Replace every outlier on the lower side by the lower whisker
    - for i, j in zip(np.where(boston_df4 < Q1 - 1.5 * IQR)\[0\], np.where(boston_df4 < Q1 - 1.5 * IQR)[1]): 
    - whisker  = Q1 - 1.5 * IQR
    - boston_df4.iloc\[i,j\] = whisker\[j\]

## Bivariate analysis
- Numerical vs. Numerical
    1. Scatterplot
    2. Line plot
    3. Heatmap for correlation
    4. Joint plot
- Categorical vs. Numerical
    1. Bar chart
    2. Voilin plot
    3. Categorical box plot
    4.Swarm plot
- Two Categorical Variables
    1. Bar chart
    2. Grouped bar chart
    3. Point plot
## Pandas Profiling
- import pandas_profiling 
- pandas_profiling.ProfileReport(df)
- pandas_profiling.ProfileReport(df).to_file("output.html")