# Understanding data and preprocessing
## Types of data: strutured and unstructured

<u>Structured data</u>

Structured data refers to information that is organized in a fixed format, typically in tables, spreadsheets, or relational databases. It follows a predefined schema with rows and columns, making it easy to store, query, and analyze using database management systems and analytical tools. 

Examples of structured data: 
- Customer transaction records in a retail store  
- Employee databases containing names, salaries, and job titles  
- Financial statements showing revenues and expenses

Structured data is advantageous for machine learning as it can be easily manipulated using programming languages like Python, SQL, and libraries such as Pandas and NumPy.

<u>Unstructured data</u>

Unstructured data, on the other hand, lacks a predefined format and does not fit neatly into tables or relational databases. It includes text, images, audio, and video files. Processing unstructured data requires specialized techniques, including natural language processing (NLP) for text, computer vision for images, and speech recognition for audio. 

Examples of unstructured data: 

- Social media posts, emails, and chat messages  
- Medical imaging scans such as X-rays and MRIs  
- Audio recordings and speech transcriptions  
- Handling unstructured data often involves feature extraction and transformation techniques to convert the raw data into a structured format suitable for machine learning models.

## Handling missing data

In real-world datasets, missing values are a common issue and must be handled carefully to prevent bias or inaccurate predictions. There are several techniques to deal with missing data, depending on the context and severity of the problem.

In [None]:
# 1. removing missing data
import pandas as pd
df = pd.read_csv('data.csv')
df_cleaned = df.dropna()    # remove rows with missing values

# 2. imputation (filling missing values)
# Instead of removing missing data, we can fill in the gaps with meaningful values. 
# This can be done using:  
# - Mean or median imputation for numerical data  
# - Mode (most frequent value) for categorical data  
# - Forward fill (using previous values) or backward fill (using next values) for time-series data
df['Age'].fillna(df['Age'].mean(), inplace=True)  # fill missing ages with the mean

# 3. Using Machine Learning Models for Imputation: 
# In some cases, missing values can be predicted using machine learning models trained on existing data. 
# Algorithms such as k-nearest neighbors (KNN) or regression models can estimate missing values based on known patterns.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df))

## Data cleaning and transformation

Once missing values are handled, further data cleaning and transformation steps are necessary to prepare the dataset for machine learning models.


In [None]:
# Removing duplicates
# df.drop_duplicates(inplace=True)    # Do not use "inplace=True" when modifying a dataframe.sonarqube(ipython:S6734)
df.drop_duplicates()

# Handling Outliers 
# Outliers are extreme values that deviate significantly from the dataset's overall distribution. 
# Outliers can skew model predictions and should be treated appropriately.
import numpy as np
z_scores = np.abs((df['Salary']-df['Salary'].mean()) / df['Salary'].std())
df_no_outliers = df[z_scores < 3]  # Keeping values within 3 standard deviations

# Encoding Categorical Data
# Machine learning algorithms require numerical input, so categorical data must be converted into numerical format.
# Two common encoding techniques are:
# - One-hot encoding (for nominal categorical variables, where no order exists)
df = pd.get_dummies(df, columns=['Gender']) # Converts 'Male'and 'Female' into binary columns
# - Label encoding (for ordinal categorical variables, where order matters)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Education_Level'] = encoder.fit_transform(df['Education_Level'])  # Converts 'High School', 'Bachelor's', etc. into numerical values

# Feature Scaling
# Feature scling ensures that numerical variables are in the same range, preventing features with larger values from dominating the learning process.
# Two common techniques are:
# - Standardization (z-score normalization)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# - Min-Max Scaling (rescaling values between 0 and 1)
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
df_scaled = min_max_scaler.fit_transform(df)



## Example: Filling missing values in a dataset using Pandas

Let's consider an example where we load a dataset, identify missing values, and use imputation techniques to fill in the missing data.

In [9]:
import pandas as pd
import numpy as np

# Creating a sample dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 30, np.nan,40],
    'Salary': [50000, 60000, np.nan, 80000, 90000]
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)

# Filling missing values with the mean of respective columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

print("\nDataset after filling missing values:")
print(df)

Original Dataset:
      Name   Age   Salary
0    Alice  25.0  50000.0
1      Bob   NaN  60000.0
2  Charlie  30.0      NaN
3    David   NaN  80000.0
4      Eve  40.0  90000.0

Dataset after filling missing values:
      Name        Age   Salary
0    Alice  25.000000  50000.0
1      Bob  31.666667  60000.0
2  Charlie  30.000000  70000.0
3    David  31.666667  80000.0
4      Eve  40.000000  90000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)
