Here’s a Guide on Data Collection and Data Preprocessing, covering data sources, cleaning, encoding, and scaling.
Data collection is the first step in any data-driven project. The sources of data can be classified into:
- Surveys: Collect responses from users through Google Forms, Typeform, or Microsoft Forms.
- Interviews: Structured or unstructured interviews with individuals or experts.
- Sensors & IoT Devices: Data collected from hardware sensors, such as temperature, humidity, or motion detectors.
- Web Scraping: Extracting data from websites using libraries like
BeautifulSoup
orScrapy
.
- Kaggle: A vast repository of open-source datasets for machine learning and analytics.
- data.gov.in: A government portal providing public datasets on demographics, economics, and more.
- Internet Archive: A digital library containing historical data, text, images, and videos.
- UCI Machine Learning Repository: Offers various structured datasets for machine learning research.
Raw data is often messy and requires extensive preprocessing. The key steps are:
Duplicates can occur due to multiple data entries or data merging. Use pandas
to remove them:
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Remove duplicate rows
df = df.drop_duplicates()
# Reset index after dropping
df.reset_index(drop=True, inplace=True)
Missing values can negatively impact model performance. There are multiple ways to handle them:
df = df.dropna()
- Mean Imputation (for numerical data):
df.fillna(df.mean(), inplace=True)
- Mode Imputation (for categorical data):
df['Category_Column'].fillna(df['Category_Column'].mode()[0], inplace=True)
- Forward Fill (Using previous values):
df.fillna(method='ffill', inplace=True)
- Backward Fill (Using next values):
df.fillna(method='bfill', inplace=True)
Outliers can distort the results. We use IQR (Interquartile Range) method to remove them:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
Machine learning models work with numerical values, so categorical data must be encoded.
Used when categories have a meaningful order (e.g., Low < Medium < High):
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Category'] = encoder.fit_transform(df['Category'])
Used when categories have no specific order:
df = pd.get_dummies(df, columns=['Category'], drop_first=True)
Feature scaling ensures that all numerical features have the same scale, preventing dominant features from skewing the model.
Scales values between 0 and 1:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])
Transforms data to have zero mean and unit variance:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])
Uses median and IQR, making it robust to outliers:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])
- Collect Data → From surveys, Kaggle, data.gov.in, etc.
- Remove Duplicates →
df.drop_duplicates()
- Handle Missing Values →
df.fillna(method='ffill')
- Handle Outliers → Using IQR method
- Encode Categorical Data →
LabelEncoder()
orpd.get_dummies()
- Scale Data → Using MinMaxScaler, StandardScaler, or RobustScaler