# Module: Data Collection and Cleaning


Data is the foundation of any data science project. This module focuses on acquiring data from various sources and preparing it for analysis. High-quality, clean data is essential for accurate insights and reliable models
## Submodule 1: Types of Data
Structured data is organized into well-defined rows and columns, like a spreadsheet. Unstructured data lacks a specific structure and can include text, images, audio, and video
## 

Submodule 2: Data Sources and Collection Methods
Data can be collected from various sources, such as APIs, web scraping, and databases. For example, let's consider collecting data from a public API:.



In [None]:
import requests

# API endpoint
url = "https://api.openweathermap.org/data/2.5/weather"
params = {"q": "New York", "appid": "your_api_key"}

# Send GET request to the API
response = requests.get(url, params=params)
data = response.json()

print(data)


Submodule 3: Data Acquisition and Import
Importing data into your analysis environment is crucial. Libraries like Pandas make this process easier. Let's import data from a CSV file:

In [None]:
import pandas as pd

# Read CSV file
data = pd.read_csv("data.csv")

# Display the first few rows
print(data.head())


Submodule 4: Data Cleaning and Preprocessing
Clean data by handling missing values, outliers, and duplicates. Let's clean missing values from a Pandas DataFrame:

Example: Handling Missing Values

In [None]:
# Replace missing values with the mean
data["age"].fillna(data["age"].mean(), inplace=True)


Submodule 5: Data Transformation and Feature Engineering
Transform data for analysis. Feature engineering involves creating new features from existing ones. Let's create a new feature based on existing ones:

Example: Feature Engineering

In [None]:
# Create a new feature 'total_sales'
data["total_sales"] = data["product_price"] * data["quantity_sold"]


Submodule 6: Dealing with Messy Data
Address inconsistencies and errors in data using regular expressions. Let's clean a column containing phone numbers:

Example: Cleaning Phone Numbers

In [None]:
import re

def clean_phone_number(phone):
    cleaned_phone = re.sub(r"[^0-9]", "", phone)
    return cleaned_phone

data["phone_number"] = data["phone_number"].apply(clean_phone_number)


Submodule 7: Data Quality Assurance
Implement validation checks to ensure data quality. Let's validate email addresses:

Example: Validating Email Addresses

In [None]:
import re

def is_valid_email(email):
    pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    return re.match(pattern, email) is not None

data["is_valid_email"] = data["email"].apply(is_valid_email)


Submodule 8: Case Study and Practical Exercises
Work on case studies to apply data collection and cleaning techniques to real-world scenarios. For example, you might clean and preprocess a dataset of customer reviews, identifying and resolving issues like duplicate entries and missing values.

Benefits:

Learners gain hands-on experience in collecting, cleaning, and preprocessing data using real-world examples.
They develop practical skills to handle messy data and ensure data quality.
By practicing with code examples, learners are better prepared to tackle data challenges in their own projects.
Key Takeaways:
The "Data Collection and Cleaning" submodule equips learners with essential skills to acquire, clean, and prepare data for analysis. These skills are foundational for successful data science projects.