<a href="https://colab.research.google.com/github/sbborusu/530pm_Agentic_Ai_Batch_7thApril/blob/main/scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scikit Introduction

1. Primarily designed for machine learning rather than pure statistical analysis.
2. scikit-learn and sklearn refer to the same library.
3. scikit-learn is the official name of the popular Python machine learning library.
4. sklearn is simply the alias used when importing the library in Python (import sklearn)
5. So, while you install it using pip install scikit-learn, you use import sklearn in your code.





In [1]:
# pip install scikit-learn

# Impute using SimpleImputer

1. SimpleImputer module in scikit-learn allows you to fill missing values using statistical strategies Mean, Median, Mode, constant.



In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample dataset with missing values (NaN)
data = np.array([[7, 2, np.nan],
                 [4, np.nan, 6],
                 [np.nan, 3, 8],
                 [5, 6, 9]])

# Convert to DataFrame for better visualization
df = pd.DataFrame(data, columns=['Feature1', 'Feature2', 'Feature3'])
# print("Original Data:\n", df)

# Creates an imputer to fill missing values with the mean. Only Intializes the object 'Simple Imputer' ie., sets up the imputer with the specified strategy—it does not perform the imputation immediately.
imputer = SimpleImputer(strategy='mean')

# Instead of calling fit() and then transform() separately, fit_transform() does both in a single step.
# 1. Calculates the mean (or other chosen strategy) for each column with missing values.
# 2. Replaces missing values with the computed mean.

data_imputed = imputer.fit_transform(df) # imputes the entire dataframe and copies to a different dataframe. returns an nd-array

# impute specific column in a dataframe and put it back in the same dataframe
#df.iloc[:, 0:1] = imputer.fit_transform(df.iloc[:,0:1])
# (or)
#df['Feature2'] = imputer.fit_transform(df[['Feature2']])
#print(df)

# Convert back to DataFrame
imputed_df = pd.DataFrame(data_imputed, columns=df.columns)
print("\nData After Imputation:\n", imputed_df)





Data After Imputation:
    Feature1  Feature2  Feature3
0  7.000000  2.000000  7.666667
1  4.000000  3.666667  6.000000
2  5.333333  3.000000  8.000000
3  5.000000  6.000000  9.000000


# Categorical to Numerical data conversion

In [3]:
# Label Encoding Technique

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = pd.DataFrame({'State': ['California', 'Texas', 'Georgia', 'Texas', 'California']})

label = LabelEncoder()

#In the case of LabelEncoder, fit_transform() performs two steps in one:
#- fit(): It learns the unique categories in the column and assigns each a numerical label.
#- transform(): It converts the categorical values into their corresponding numeric labels.

data['State'] = label.fit_transform(data['State'])

# Split the variables for training and testing.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Sample data
X = [[25, 50000], [30, 60000], [35, 70000], [40, 80000], [45, 90000]]
y = [0, 0, 0, 1, 1]  # 0 = not purchased, 1 = purchased

#perform feature scaling to make the magnitude of all features similar
scalar = StandardScaler()
X = scalar.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #random_state ensures same data is used as split each time we run this code. 42 or any other
# integer between 0 and 99 can be used. No special meaning to the integer value being used, but its a common and peculiar habbit in ML for programmers to use 42.

print("X_train:", X_train)
print("y_train:", y_train)
print("X_test:", X_test)
print("y_test:", y_test)

X_train: [[ 1.41421356  1.41421356]
 [ 0.          0.        ]
 [-1.41421356 -1.41421356]
 [ 0.70710678  0.70710678]]
y_train: [1, 0, 0, 1]
X_test: [[-0.70710678 -0.70710678]]
y_test: [0]


# Create a modal using Logistic Regression algorithm

In [5]:
from sklearn.linear_model import LogisticRegression
# Creating and training the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)
print(y_pred)

[0]


# Evaluation metrics (compare the results y-test (actual) and y-predict(predicted))

In [6]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred)) # This tells you how many predictions were correct when compared with y_test. 1.0 = 100%
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred, labels=[1,0]))


Accuracy: 1.0
Confusion Matrix:
 [[0 0]
 [0 1]]


# Pickle the model as an object into a file for reusability

In [8]:
import pickle

# save the model as .pkl file
with open('purchse_classifier.pkl', 'wb') as file:
    pickle.dump(model, file)

# load the model when needed
#with open('purchse_classifier.pkl', 'rb') as file:
#    loaded_model = pickle.load(file)