# Pandas with Machine Learning
### **Note Before You Start**

Before diving into this notebook, it is important to have a basic understanding of machine learning concepts. Below are some fundamental topics that will help you follow along more effectively:

1. **Supervised vs. Unsupervised Learning**: 
   - Understand the difference between supervised learning (where we have labeled data) and unsupervised learning (where we don't have labels).

2. **Data Splitting (Train/Test Sets)**: 
   - Know why we split data into training and test sets, and the concept of model validation.

3. **Feature Engineering**:
   - Learn how to select, create, and transform features that will help improve your model's performance.

4. **Data Preprocessing**:
   - Concepts such as handling missing values, encoding categorical variables, and scaling numerical features are essential steps in preparing your data for machine learning.

5. **Overfitting and Underfitting**:
   - Understand the concepts of overfitting (when a model learns noise in the data) and underfitting (when a model is too simple to capture the underlying pattern).

6. **Imbalanced Datasets**:
   - Learn how imbalanced datasets can affect model performance and how to handle them with techniques like oversampling, undersampling, or using balanced metrics.

7. **Model Evaluation Metrics**:
   - Familiarize yourself with different metrics for evaluating model performance, such as accuracy, precision, recall, F1-score, and ROC-AUC for classification, and MSE or R-squared for regression.

8. **Scikit-learn**:
   - Have some familiarity with the `scikit-learn` library, as it will be used extensively for machine learning tasks like preprocessing, model training, and evaluation.

Having a good grasp of these topics will help you better understand the code and the concepts explained throughout this notebook. If you are new to any of these topics, consider reviewing some foundational machine learning resources before proceeding.


### 1. Preprocessing Data for Machine Learning
Before feeding data into machine learning algorithms, it's important to clean and preprocess it.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Sample dataset
df_ml = pd.DataFrame({
    'Age': [25, 35, 45, 50, None],
    'Salary': [50000, 60000, None, 120000, 80000],
    'Purchased': ['No', 'Yes', 'No', 'Yes', 'No']
})

# Handling missing values
df_ml['Age'].fillna(df_ml['Age'].mean(), inplace=True)
df_ml['Salary'].fillna(df_ml['Salary'].median(), inplace=True)

# Encoding categorical variables
df_ml['Purchased'] = df_ml['Purchased'].map({'No': 0, 'Yes': 1})

# Splitting data into train and test sets
X = df_ml[['Age', 'Salary']]
y = df_ml['Purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train)


     Age    Salary
4  38.75   80000.0
2  45.00   70000.0
0  25.00   50000.0
3  50.00  120000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_ml['Age'].fillna(df_ml['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_ml['Salary'].fillna(df_ml['Salary'].median(), inplace=True)


### 2. Feature Engineering with Pandas
Creating new features or transforming existing features can help improve the performance of machine learning models.

In [3]:
# Sample data
df_engineering = pd.DataFrame({
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
    'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']),
    'Sales': [200, 300, 250, 220, 270]
})

# Creating new features from existing columns
df_engineering['DayOfWeek'] = df_engineering['Date'].dt.day_name()
df_engineering['Sales_Ratio'] = df_engineering['Sales'] / df_engineering['Sales'].sum()

print(df_engineering)

          City       Date  Sales  DayOfWeek  Sales_Ratio
0     New York 2023-01-01    200     Sunday     0.161290
1  Los Angeles 2023-01-02    300     Monday     0.241935
2      Chicago 2023-01-03    250    Tuesday     0.201613
3     New York 2023-01-04    220  Wednesday     0.177419
4      Chicago 2023-01-05    270   Thursday     0.217742


### 3. Integrating Pandas with Scikit-learn (train_test_split(), get_dummies())
Pandas works seamlessly with Scikit-learn for encoding categorical variables and splitting data for training and testing.

In [4]:
from sklearn.linear_model import LogisticRegression

# Encoding categorical variables using get_dummies
df_ml_dummies = pd.get_dummies(df_engineering, columns=['City', 'DayOfWeek'], drop_first=True)

# Splitting data for Scikit-learn
X = df_ml_dummies.drop(['Sales', 'Date'], axis=1)
y = df_ml_dummies['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a simple model
model = LogisticRegression()
model.fit(X_train, y_train)

print("Training score:", model.score(X_train, y_train))

Training score: 1.0


### 4. Handling Imbalanced Data
Imbalanced datasets can lead to biased model predictions. Techniques like oversampling, undersampling, and SMOTE help balance the data.

In [5]:
from sklearn.utils import resample

# Sample imbalanced dataset
df_imbalanced = pd.DataFrame({
    'Feature': [1, 2, 3, 4, 5, 6],
    'Class': [0, 0, 0, 0, 1, 1]
})

# Upsample minority class
df_minority = df_imbalanced[df_imbalanced['Class'] == 1]
df_majority = df_imbalanced[df_imbalanced['Class'] == 0]

df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)

# Combine the majority class with the upsampled minority class
df_balanced = pd.concat([df_majority, df_minority_upsampled])

print(df_balanced['Class'].value_counts())

Class
0    4
1    4
Name: count, dtype: int64


### 5. Scaling and Normalization
Standardizing or normalizing features can help improve the performance of machine learning models.

In [6]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data for scaling
df_scaling = pd.DataFrame({
    'Age': [20, 30, 40, 50],
    'Income': [20000, 50000, 80000, 100000]
})

# Standardization
scaler = StandardScaler()
df_scaling['Age_Standardized'] = scaler.fit_transform(df_scaling[['Age']])

# Normalization
minmax_scaler = MinMaxScaler()
df_scaling['Income_Normalized'] = minmax_scaler.fit_transform(df_scaling[['Income']])

print(df_scaling)

   Age  Income  Age_Standardized  Income_Normalized
0   20   20000         -1.341641              0.000
1   30   50000         -0.447214              0.375
2   40   80000          0.447214              0.750
3   50  100000          1.341641              1.000


### 6. Working with Text Data for NLP
Text data needs to be cleaned, tokenized, and vectorized before it can be used for machine learning.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
df_text = pd.DataFrame({
    'Text': ['This is a sentence.', 'Machine learning is fun!', 'Pandas is great for data analysis.']
})

# Using CountVectorizer to transform text data into feature vectors
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(df_text['Text'])

# Display the feature vectors as a DataFrame
df_vectorized = pd.DataFrame(X_text.toarray(), columns=vectorizer.get_feature_names_out())

print(df_vectorized)

   analysis  data  for  fun  great  is  learning  machine  pandas  sentence  \
0         0     0    0    0      0   1         0        0       0         1   
1         0     0    0    1      0   1         1        1       0         0   
2         1     1    1    0      1   1         0        0       1         0   

   this  
0     1  
1     0  
2     0  


### 7. Feature Selection and Dimensionality Reduction
Reducing the number of features can help improve model performance and reduce overfitting.

In [8]:
from sklearn.decomposition import PCA
import numpy as np

# Generating synthetic data
df_pca = pd.DataFrame(np.random.randn(100, 5), columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])

# Applying PCA for dimensionality reduction
pca = PCA(n_components=2)
df_reduced = pca.fit_transform(df_pca)

print("Explained variance ratio:", pca.explained_variance_ratio_)

Explained variance ratio: [0.26031933 0.23235218]


### 8. Data Transformation Pipelines
Using pipelines ensures that all preprocessing steps are applied consistently during model training and inference.

In [9]:
from sklearn.pipeline import Pipeline

# Sample data
df_pipeline = pd.DataFrame({
    'Feature1': [10, 20, 30, 40],
    'Feature2': [100, 200, 300, 400]
})
y_pipeline = [0, 1, 0, 1]

# Creating a pipeline for scaling and model training
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Training the pipeline
pipeline.fit(df_pipeline, y_pipeline)

# Predicting using the pipeline
predictions = pipeline.predict(df_pipeline)
print("Predictions:", predictions)

Predictions: [0 0 1 1]
