Topic 3: **Data Reduction**

Data reduction techniques aim to reduce the size or complexity of a dataset while preserving as much relevant information as possible. This can be beneficial for improving computational efficiency, addressing the curse of dimensionality, and improving the performance of machine learning algorithms. Let's explore two common data reduction techniques in detail:

### 1. Dimensionality Reduction

Dimensionality reduction techniques aim to reduce the number of features in a dataset while preserving important information. This is particularly useful when dealing with high-dimensional data, as it can help alleviate issues such as overfitting and improve the interpretability of the model. Common dimensionality reduction techniques include:

#### a. Principal Component Analysis (PCA)

PCA is a popular technique for reducing the dimensionality of a dataset by transforming the original features into a new set of orthogonal variables called principal components. These principal components capture the maximum variance in the data.

In [1]:
from sklearn.decomposition import PCA
import pandas as pd

# Example data with multiple features
data = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5],
                     'Feature2': [5, 4, 3, 2, 1]})

# PCA dimensionality reduction
pca = PCA(n_components=1)
data_reduced = pca.fit_transform(data)

print("Original data:")
print(data)
print("\nReduced data:")
print(data_reduced)

Original data:
   Feature1  Feature2
0         1         5
1         2         4
2         3         3
3         4         2
4         5         1

Reduced data:
[[ 2.82842712]
 [ 1.41421356]
 [-0.        ]
 [-1.41421356]
 [-2.82842712]]


### 2. Sampling

Sampling techniques are used to address class imbalance issues by either undersampling the majority class or oversampling the minority class. This helps balance the distribution of classes in the dataset and can improve the performance of machine learning models, especially when dealing with imbalanced datasets. Common sampling techniques include:

#### a. Random Undersampling

Random undersampling involves randomly selecting a subset of samples from the majority class to match the size of the minority class.

In [3]:
!pip install imblearn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd

# Example data with imbalanced classes
data = pd.DataFrame({'Class': ['A']*10 + ['B']*100})
data

Unnamed: 0,Class
0,A
1,A
2,A
3,A
4,A
...,...
105,B
106,B
107,B
108,B


In [8]:
# Undersampling
undersampler = RandomUnderSampler()
X_under, y_under = undersampler.fit_resample(data.drop('Class', axis=1), data['Class'])

print("Original data class distribution:")
print(data['Class'].value_counts())
print("\nUndersampled data class distribution:")
print(pd.Series(y_under).value_counts())

Original data class distribution:
B    100
A     10
Name: Class, dtype: int64

Undersampled data class distribution:
A    10
B    10
Name: Class, dtype: int64


Unnamed: 0,Class
0,A
1,A
2,A
3,A
4,A
...,...
105,B
106,B
107,B
108,B


#### b. Random Oversampling

Random oversampling involves randomly duplicating samples from the minority class to match the size of the majority class.

In [6]:
from imblearn.over_sampling import RandomOverSampler
import pandas as pd

# Example data with imbalanced classes
data = pd.DataFrame({'Class': ['A']*10 + ['B']*100})

# Oversampling
oversampler = RandomOverSampler()
X_over, y_over = oversampler.fit_resample(data.drop('Class', axis=1), data['Class'])

print("Original data class distribution:")
print(data['Class'].value_counts())
print("\nOversampled data class distribution:")
print(pd.Series(y_over).value_counts())

Original data class distribution:
B    100
A     10
Name: Class, dtype: int64

Oversampled data class distribution:
A    100
B    100
Name: Class, dtype: int64
