<a href="https://colab.research.google.com/github/yavuzuzun/projects/blob/main/Ex_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2

# Load the data
data = pd.read_csv('data.csv')

# Drop irrelevant columns
data.drop(['ID', 'Name'], axis=1, inplace=True)

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

# Impute missing values in the Age column with the mean age
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
data[['Age']] = imputer.fit_transform(data[['Age']])

# Check for missing values again
missing_values = data.isnull().sum()
print(missing_values)

# Convert categorical variable to numerical
label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])

# One-hot encode categorical variables
onehot_encoder = OneHotEncoder(handle_unknown='ignore')
categorical_features = ['Country']
transformer = ColumnTransformer(transformers=[('onehot', onehot_encoder, categorical_features)], remainder='passthrough')
data = transformer.fit_transform(data)

# Scaling numerical features
scaler = StandardScaler()
numerical_features = ['Age', 'Salary']
data[numerical_features] = scaler.fit_transform(data[numerical_features])

# Feature selection
features = data.drop(['Purchased'], axis=1)
target = data['Purchased']
selector = SelectKBest(score_func=chi2, k='all')
selector.fit(features, target)
selected_features = selector.transform(features)
print(selected_features)

# Save preprocessed data
preprocessed_data = pd.DataFrame(data=selected_features, columns=features.columns[selector.get_support()])
preprocessed_data['Purchased'] = target
preprocessed_data.to_csv('preprocessed_data.csv', index=False)


In this example, I start by loading the data from a CSV file called 'data.csv'. I then drop irrelevant columns such as 'ID' and 'Name'. I check for missing values and find that the 'Age' column has missing values, which I impute with the mean age using a SimpleImputer. I check for missing values again to confirm that there are no more missing values in the data.

I then convert the categorical variable 'Gender' to numerical using a LabelEncoder. I one-hot encode the categorical variable 'Country' using a OneHotEncoder and a ColumnTransformer. I scale the numerical features 'Age' and 'Salary' using a StandardScaler.

Next, I perform feature selection using SelectKBest and the chi-squared statistical test. I save the preprocessed data to a new CSV file called 'preprocessed_data.csv'.

It's important to carefully evaluate the effects of each preprocessing step on the performance of the machine learning model.