<a href="https://colab.research.google.com/github/sahilaf/Machine-learning/blob/main/Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis (EDA)

**EDA Steps**

Here's a structured approach to Exploratory Data Analysis:

### Step 1: Understand the Data
- View initial rows and columns: Use `.head()`, `.tail()`, `.shape`.
- Examine data types and non-null values: Use `.info()`.
- Identify columns and their data types.

### Step 2: Summary Statistics
- Calculate descriptive statistics: Use `.describe()`. This includes mean, median, mode, standard deviation, min, max, and quartiles for numerical data.
- Understand the spread and central tendency of the data.

### Step 3: Value Counts
- Check unique values and their frequencies in columns: Use `.value_counts()`.
- Identify potential duplicates or inconsistencies in categorical data.
- Example: `df['column_name'].value_counts()`

### Step 4: Missing Value Analysis
- Identify where data is missing: Use `.isnull()`.
- Calculate the percentage of missing data per column: Use `.isnull().sum() / len(df) * 100`.
- Understand the extent and location of gaps in the dataset.

### Step 5: Visualizations
- **Histograms**: Show the distribution of numerical variables. Example: `df['numerical_column'].hist()`
- **Boxplots**: Identify outliers and understand the spread of numerical data. Example: `df.boxplot(column='numerical_column')`
- **Bar plots**: Compare categories. Example: `df['categorical_column'].value_counts().plot(kind='bar')`
- **Correlation Heatmaps**: Visualize linear relationships between numerical features. Example: `sns.heatmap(df.corr(), annot=True)` (Requires importing `seaborn`)
- **Scatter Plots**: Explore bivariate relationships between two numerical variables. Example: `df.plot(kind='scatter', x='column1', y='column2')`

### Step 6: Target Variable Exploration
- Analyze the distribution of the target variable.
- Explore how the target variable relates to other features using visualizations and summary statistics.

# Data Cleaning


Strategies to handle common data cleaning issues:

### 1. Handle Missing Values

- **Drop missing rows/columns**: If the number of missing values is very small or the column is not essential. Use `df.dropna()`.
- **Impute missing values**:
    - **Numerical data**: Fill with the mean or median. Example: `df['numerical_column'].fillna(df['numerical_column'].mean())`
    - **Categorical data**: Fill with the mode. Example: `df['categorical_column'].fillna(df['categorical_column'].mode()[0])`
    - **Advanced methods**: Consider using linear regression, KNN, or interpolation for more sophisticated imputation (for future learning).

### 2. Remove Duplicates
- Detect and drop exact duplicate rows: Use `df.drop_duplicates()`.

### 3. Fix Data Types
- Convert columns to the correct data types (e.g., from object to datetime or numerical). Example: `pd.to_datetime(df['date_column'])` or `df['numerical_column'].astype(int)`

### 4. Handle Inconsistent Categories
- Clean up categorical values to ensure uniformity (e.g., 'USA', 'U.S.A.', and 'United States' should be consistent). This might involve using string manipulation methods.

### 5. Detect and Handle Outliers
- **Detection**: Use visualizations like boxplots, or statistical methods like the Interquartile Range (IQR) or Z-score.
- **Handling**: Remove outliers (if they are errors) or cap them at a certain value. Example (capping using IQR): `Q1 = df['column'].quantile(0.25)`, `Q3 = df['column'].quantile(0.75)`, `IQR = Q3 - Q1`, `upper_bound = Q3 + 1.5 * IQR`, `df['column'] = df['column'].clip(upper=upper_bound)`

### 6. Fix Logical or Domain Errors
- Address values that are incorrect based on domain knowledge (e.g., a negative age, or a purchase date before the product was released). This often requires conditional logic to identify and correct these values.

# Data Preprocessing


Data preprocessing is a crucial step to prepare your data for machine learning models.

### 1. Encoding Categorical Variables
- Convert text labels into numerical representations that machine learning algorithms can understand.

#### Methods:
1. **Label Encoding (Ordinal)**:
   - Good for ordered categorical variables where there is a natural ranking (e.g., "low", "medium", "high"). Assigns a unique integer to each category.
   - Example: Encoding 'size' column: `from sklearn.preprocessing import LabelEncoder`, `le = LabelEncoder()`, `df['size_encoded'] = le.fit_transform(df['size'])`

2. **One-Hot Encoding (Nominal)**:
   - For non-ordered categorical variables where there is no inherent ranking (e.g., 'region', 'color'). Creates new binary columns for each category.
   - Example: Encoding 'color' column: `pd.get_dummies(df['color'])` or `from sklearn.preprocessing import OneHotEncoder`, `ohe = OneHotEncoder()`, `encoded_data = ohe.fit_transform(df[['color']])`

### 2. Feature Transformation
- Used to handle skewed data distributions (e.g., right-skewed or left-skewed data) or to meet the assumptions of certain models.
- Common transformations include logarithmic transformation, square root transformation, or Box-Cox transformation.
- Example (Log transform): `import numpy as np`, `df['skewed_column_log'] = np.log(df['skewed_column'])`

### 3. Feature Scaling
- Standardize the range of independent features. This is important for algorithms that are sensitive to the scale of the input data (e.g., gradient descent-based algorithms, KNN, SVMs).
- Converts all features to a similar scale, often between 0 and 1 or with a mean of 0 and standard deviation of 1.

#### Methods:
1. **Min-Max Scaling**: Scales features to a fixed range, usually 0 to 1. Example: `from sklearn.preprocessing import MinMaxScaler`, `scaler = MinMaxScaler()`, `df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])`


2. **Standardization (Z-score scaling)**: Scales features to have a mean of 0 and a standard deviation of 1. Example: `from sklearn.preprocessing import StandardScaler`, `scaler = StandardScaler()`, `df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])`

# Feature Engineering

Creating new features or transforming existing ones to expose useful patterns that ML models can learn from.

## Common techniques:
- Mathematical Combinations
- Target-Based flags
- Binning
- Time-based features

# Feature Selection

Selecting the most useful features and removing the rest.

**Why is it important?**
- Reduces noise and overfitting
- Speeds up traning
- Impoves accuracy
- Makes model interpretation easier

## Methods:
### **1. Filter methods (Pure Stat)**
- Correlation Matrix -> remove highly correlated features
- Chi-square test (categorical vs cetagorical)
- Anova F-test (numerical vs categorical target)

### **2. Embedded methods (Selection build into the model)**
- Lasso Regression -> Shrinks coefficients to 0
- Tree-based models (random forest,xgboost) -> feature importance scores.

