# Data Preproceing and Feature Engineering

### [Video Link](https://www.youtube.com/watch?v=3imSHVySLRc&list=PLfP3JxW-T70GR0w3zVzG7tgIFI14FZxaj&index=1) 

### Data Preprocessing: 
Data preprocessing is a process to convert raw data into meaningful data using different techniques.
<br>

### Why is Data Preprocessing Important
### Data in the real World is dirty.
- Incomplete
- Noisy
- Inconsistent
- Duplicate
<br>
<br>

### To train our ML algorithrm(Learn LIke A Kids) We need to convert this data into quality data. Those data have

- Accuracy 
- Completeness
- Consistency 
- Believability
- Interpretability

--- 
# Explanation:
1. **Accuracy**:
   - **Definition**: Accuracy refers to how close the data values are to the true values or the actual state of affairs.
   - **Importance in ML**: In machine learning, accurate data ensures that the model learns from reliable information, leading to more precise predictions or classifications.
   - **Example**: If a dataset records the heights of individuals, accurate data would mean that the recorded heights closely match the actual heights of those individuals.

2. **Completeness**:
   - **Definition**: Completeness refers to the extent to which all required data is present in the dataset without any missing values.
   - **Importance in ML**: Complete data ensures that the model has access to all necessary information for learning patterns and making predictions.
   - **Example**: In a dataset recording customer information, completeness would mean that there are no missing entries for essential fields like name, age, or contact information.

3. **Consistency**:
   - **Definition**: Consistency in data refers to uniformity and coherence across the dataset. It means that data values are presented in a standardized manner without contradictions or discrepancies.
   - **Importance in ML**: In machine learning, consistent data ensures that the patterns and relationships learned by the model are reliable and accurate. Inconsistent data could lead to biased or incorrect conclusions.
   - **Example**: Suppose you have a dataset of customer addresses where some entries use abbreviated state names (e.g., "CA" for California) while others use the full state name ("California"). Achieving consistency would involve standardizing the format so that all addresses use the same representation for states.

4. **Believability**:
   - **Definition**: Believability refers to the trustworthiness and reliability of the data. It involves assessing the credibility of the information based on its source, accuracy, and relevance.
   - **Importance in ML**: Believable data is crucial for building trustworthy machine learning models. If the data used for training is not credible, the resulting model's predictions or insights may be inaccurate or misleading.
   - **Example**: Consider a dataset of medical records collected from reputable hospitals and clinics versus one obtained from unknown sources online. The former would generally be more believable due to the credibility of the institutions involved.

5. **Interpretability**:
   - **Definition**: Interpretability refers to the ease with which data can be understood and analyzed by humans. It involves presenting data in a clear and understandable format, often with proper documentation.
   - **Importance in ML**: In machine learning, interpretability is crucial for stakeholders to understand how models make predictions or classifications. Transparent models and interpretable data help users trust and validate the model's outputs.
   - **Example**: If you're analyzing a complex dataset with numerous variables, ensuring interpretability might involve providing clear descriptions and explanations of each variable's meaning and significance, as well as visualizations that aid in understanding the data's underlying patterns.

---

<br>

### Major setps in Data Preprocessing : 
- Data Cleaning
- Data Integration
- Data Reduction
- Data Transformation
-Data Discretization
<br>

### What is Data Cleaning?

- Data Cleaning means fill in missing values, smooth out noise while identifying outliers, and corrext inconsistencies in the data.

In [6]:
import pandas as pd

# Create a sample DataFrame with missing values and duplicate rows
data = {
    'Name': ['John', 'Alice', 'Bob', 'Alice', 'Charlie', 'David', 'John'],
    'Age': [25, 28, None, 28, 35, 30, 25],
    'Salary': [50000, 60000, 75000, None, 90000, 80000, 50000]
}

# Convert the data into pandas dataframe
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Handling missing values
df_cleaned = df.dropna()  # Remove rows with missing values

# Display the DataFrame after handling missing values
print("\nDataFrame after handling missing values:")
print(df_cleaned)

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_cleaned)


Original DataFrame:
      Name   Age   Salary
0     John  25.0  50000.0
1    Alice  28.0  60000.0
2      Bob   NaN  75000.0
3    Alice  28.0      NaN
4  Charlie  35.0  90000.0
5    David  30.0  80000.0
6     John  25.0  50000.0

DataFrame after handling missing values:
      Name   Age   Salary
0     John  25.0  50000.0
1    Alice  28.0  60000.0
4  Charlie  35.0  90000.0
5    David  30.0  80000.0
6     John  25.0  50000.0

DataFrame after removing duplicates:
      Name   Age   Salary
0     John  25.0  50000.0
1    Alice  28.0  60000.0
4  Charlie  35.0  90000.0
5    David  30.0  80000.0


### What is Data Integration?
- Data Integration is a technique to merges data from multiple sources 
into a coherent data store, such as a data warehouse.

![Data Integration Example:](assets/Data_integration.jpg)



### What is Data Reduction?
- Data Reduction is a technique use to reduce the data size by aggregating, eliminating redundant features, or clustering, for instace.

- We use Principal Component analysis(we will see it in ml) to the data reduction process

![data Reduction!](assets/data_reduction.jpg)

### Difference between **Data Integration** and **Data Reduction**
**Data Integration** and **Data Reduction** are two different concepts in the field of data management and analysis:

1. **Data Integration:**
   - **Definition:** Data integration is the process of combining data from different sources into a unified view. It involves merging data from disparate sources to provide a more comprehensive and cohesive understanding of the information.
   
<br>

2. **Data Reduction:**
   - **Definition:** Data reduction is the process of reducing the volume but producing the same or similar analytical results. It involves simplifying complex data sets while retaining the inherent information and characteristics needed for analysis.<br>

   - **Purpose:** The primary purpose of data reduction is to reduce the size or complexity of a dataset without significantly sacrificing its analytical value. This is often done to improve efficiency in data storage, processing, and analysis, especially when dealing with large datasets.
   

In summary, data integration is about combining data from different sources into a unified view, whereas data reduction is about reducing the size or complexity of a dataset while preserving its analytical value.

### What is Data Transformation?

- Data Transformation means data are Transformed or consolidated into forms appropritate for ML model traning such as normilization may be applied where data are scaled to fall within a smaller range like 0.0 to 1.0.

- Aggregation
- Feature type conversion
- Normalization
- Attribute/feature construction




<img src="assets/data_discretization.jpg">

<br>

### Prerequisite

<img src="DataPrepossingAndFeatureEngg/assets/data_discretization.jpg">