Skip to content

Files

Latest commit

 

History

History
155 lines (111 loc) · 4.43 KB

guide on Data Collection and Data Preprocessing.md

File metadata and controls

155 lines (111 loc) · 4.43 KB

Here’s a Guide on Data Collection and Data Preprocessing, covering data sources, cleaning, encoding, and scaling.

1. Data Collection

Data collection is the first step in any data-driven project. The sources of data can be classified into:

A. Primary Data Sources (Self-Collected Data)

  • Surveys: Collect responses from users through Google Forms, Typeform, or Microsoft Forms.
  • Interviews: Structured or unstructured interviews with individuals or experts.
  • Sensors & IoT Devices: Data collected from hardware sensors, such as temperature, humidity, or motion detectors.
  • Web Scraping: Extracting data from websites using libraries like BeautifulSoup or Scrapy.

B. Secondary Data Sources (Existing Datasets)

  • Kaggle: A vast repository of open-source datasets for machine learning and analytics.
  • data.gov.in: A government portal providing public datasets on demographics, economics, and more.
  • Internet Archive: A digital library containing historical data, text, images, and videos.
  • UCI Machine Learning Repository: Offers various structured datasets for machine learning research.

2. Data Preprocessing

Raw data is often messy and requires extensive preprocessing. The key steps are:

A. Data Cleaning

1. Removing Duplicates

Duplicates can occur due to multiple data entries or data merging. Use pandas to remove them:

import pandas as pd

# Load dataset
df = pd.read_csv("data.csv")

# Remove duplicate rows
df = df.drop_duplicates()

# Reset index after dropping
df.reset_index(drop=True, inplace=True)

2. Handling Missing Values

Missing values can negatively impact model performance. There are multiple ways to handle them:

a) Removing Rows with Missing Values

df = df.dropna()

b) Filling Missing Values (Imputation)

  • Mean Imputation (for numerical data):
df.fillna(df.mean(), inplace=True)
  • Mode Imputation (for categorical data):
df['Category_Column'].fillna(df['Category_Column'].mode()[0], inplace=True)
  • Forward Fill (Using previous values):
df.fillna(method='ffill', inplace=True)
  • Backward Fill (Using next values):
df.fillna(method='bfill', inplace=True)

3. Handling Outliers

Outliers can distort the results. We use IQR (Interquartile Range) method to remove them:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

B. Encoding Categorical Data

Machine learning models work with numerical values, so categorical data must be encoded.

1. Label Encoding (For Ordinal Data)

Used when categories have a meaningful order (e.g., Low < Medium < High):

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['Category'] = encoder.fit_transform(df['Category'])

2. One-Hot Encoding (For Nominal Data)

Used when categories have no specific order:

df = pd.get_dummies(df, columns=['Category'], drop_first=True)

C. Feature Scaling

Feature scaling ensures that all numerical features have the same scale, preventing dominant features from skewing the model.

1. Min-Max Scaling (Normalization)

Scales values between 0 and 1:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

2. Standardization (Z-score Scaling)

Transforms data to have zero mean and unit variance:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

3. Robust Scaling (Handling Outliers)

Uses median and IQR, making it robust to outliers:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

Final Process Overview

  1. Collect Data → From surveys, Kaggle, data.gov.in, etc.
  2. Remove Duplicatesdf.drop_duplicates()
  3. Handle Missing Valuesdf.fillna(method='ffill')
  4. Handle Outliers → Using IQR method
  5. Encode Categorical DataLabelEncoder() or pd.get_dummies()
  6. Scale Data → Using MinMaxScaler, StandardScaler, or RobustScaler