# Module 2, Lesson 1: Understanding Data Types

## Setup

In [None]:
import pandas as pd
import numpy as np
import sys

print("Python version:", sys.version[:5])
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)

---

## Part 1: Python's Basic Data Types
**The Data Type Foundation**

### Example 1: Same Value, Different Types
This example shows how the same value behaves completely differently depending on its data type. We'll store the number 42 three different ways and see what happens when we try to use them.

In [None]:
# The same number stored different ways behaves differently
number_as_int = 42
number_as_float = 42.0
number_as_string = "42"

print("Integer:", number_as_int, "→ Type:", type(number_as_int))
print("Float:  ", number_as_float, "→ Type:", type(number_as_float))
print("String: ", number_as_string, "→ Type:", type(number_as_string))
print()

# Why does this matter? Let's try to do math:
print("Integer + Integer:", number_as_int + number_as_int)
print("Float + Float:", number_as_float + number_as_float)
print("String + String:", number_as_string + number_as_string)  # Surprise!

### Example 2: Real-World Type Error
This demonstrates a common mistake where sales data is accidentally stored as text instead of numbers. This happens often when importing data from spreadsheets or web forms.

In [None]:
# The danger of wrong types - a real-world scenario
sales_good = 100
sales_bad = "100"

# This works
total_good = sales_good * 3
print(f"Correct: 100 * 3 = {total_good}")

# This doesn't do what you expect
total_bad = sales_bad * 3
print(f"Wrong type: '100' * 3 = {total_bad}")

---

## Part 2: Structured vs Unstructured Data
**Two Fundamental Categories**

### Example 3: Structured Data
Structured data fits neatly into rows and columns. This example shows customer data in a DataFrame where we can instantly calculate statistics and find insights.

In [None]:
# STRUCTURED DATA - Easy to analyze
customers = pd.DataFrame({
    'customer_id': [1001, 1002, 1003, 1004, 1005],
    'name': ['Alice Brown', 'Bob Smith', 'Charlie Lee', 'Diana Ross', 'Eve Wilson'],
    'age': [25, 32, 28, 45, 38],
    'spending': [1200.50, 850.00, 2100.75, 950.25, 1850.00]
})

print("STRUCTURED DATA (DataFrame):")
print(customers)
print()

# Instant analysis possible:
print(f"Average age: {customers['age'].mean():.1f} years")
print(f"Total revenue: ${customers['spending'].sum():,.2f}")
print(f"Biggest spender: {customers.loc[customers['spending'].idxmax(), 'name']}")

### Example 4: Unstructured Data
Unstructured data like customer reviews contains valuable information but can't be directly analyzed with standard mathematical operations. This shows why text mining and NLP are separate fields.

In [None]:
# UNSTRUCTURED DATA - Requires processing
customer_reviews = [
    "This product is amazing! Best purchase ever.",
    "Terrible quality. Broke after 2 days.",
    "Good value for money. Would recommend.",
]

print("UNSTRUCTURED DATA (Text):")
for i, review in enumerate(customer_reviews, 1):
    print(f"Review {i}: {review}")
print()

# Can't directly calculate average sentiment or extract insights
# Would need text processing, sentiment analysis, etc.
print("Notice: Can't easily calculate statistics from text!")

### Example 5: Semi-Structured Data
JSON data from APIs is semi-structured - it has organization but isn't immediately ready for analysis. This example shows how we extract the parts we need into a structured format.

In [None]:
# SEMI-STRUCTURED DATA - In between
order_json = {
    "order_id": 12345,
    "customer": {
        "name": "John Doe",
        "email": "john@example.com"
    },
    "items": [
        {"product": "Laptop", "price": 999.99},
        {"product": "Mouse", "price": 29.99}
    ]
}

print("SEMI-STRUCTURED DATA (JSON):")
print(order_json)
print()

# Can extract structured parts:
items_df = pd.DataFrame(order_json['items'])
print("Extracted items as structured data:")
print(items_df)
print(f"Order total: ${items_df['price'].sum():.2f}")

---

## Part 3: Data Types in Pandas
**How Pandas Represents Different Data**

### Example 6: Pandas Data Types Overview
This example creates a realistic employee dataset showing all the main pandas data types you'll work with: integers, floats, strings, dates, booleans, and categories.

In [None]:
# Create a DataFrame with various data types
df = pd.DataFrame({
    'employee_id': [1001, 1002, 1003, 1004],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'salary': [75000.50, 82000.00, 68500.75, 91000.00],
    'start_date': pd.to_datetime(['2022-01-15', '2021-06-01', '2023-03-20', '2020-11-10']),
    'is_active': [True, True, False, True],
    'department': pd.Categorical(['Sales', 'IT', 'Sales', 'HR'])
})

print("DataFrame with multiple data types:")
print(df)
print("\nData types for each column:")
print(df.dtypes)

### Example 7: Memory Efficiency with Categories
Categories are a special pandas type for text that repeats. This example shows how using categories instead of strings can save significant memory - important when working with large datasets.

In [None]:
# Memory usage comparison
print("Memory usage by data type:")
print(df.memory_usage(deep=True))
print()

# Convert department from category back to string to see difference
df_inefficient = df.copy()
df_inefficient['department'] = df_inefficient['department'].astype(str)

print(f"Category memory: {df['department'].memory_usage(deep=True)} bytes")
print(f"String memory: {df_inefficient['department'].memory_usage(deep=True)} bytes")
print(f"Savings: {df_inefficient['department'].memory_usage(deep=True) - df['department'].memory_usage(deep=True)} bytes")

---

## Part 4: Common Data Type Problems
**What Happens When Types Are Wrong**

### Example 8: Numeric Data Stored as Strings
This is one of the most common data problems - numbers that look right but are actually text. This example shows the error you'll get and how to fix it.

In [None]:
# Problem 1: Numbers stored as strings
bad_data = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet'],
    'price': ['999.99', '699.99', '499.99'],  # Stored as strings!
    'quantity': [5, 10, 7]
})

print("Data with price as string:")
print(bad_data)
print(bad_data.dtypes)
print()

# Try to calculate revenue
try:
    revenue = bad_data['price'] * bad_data['quantity']
except TypeError as e:
    print(f"ERROR: {e}")
    print("Can't multiply string by integer!")

### Example 9: Fixing Numeric Type Issues
Here we fix the string-number problem using pd.to_numeric(). After conversion, all our calculations work correctly.

In [None]:
# Fix the problem
bad_data['price'] = pd.to_numeric(bad_data['price'])
bad_data['revenue'] = bad_data['price'] * bad_data['quantity']

print("\nAfter fixing data type:")
print(bad_data)
print(f"\nTotal revenue: ${bad_data['revenue'].sum():,.2f}")

### Example 10: Date Strings vs DateTime Objects
Dates often come as strings, which prevents time-based calculations. This example shows why converting to datetime is essential for any time series analysis.

In [None]:
# Problem 2: Dates stored as strings
date_problems = pd.DataFrame({
    'event': ['Launch', 'Update', 'Review'],
    'date': ['2024-01-15', '2024-02-20', '2024-03-10']  # Just strings!
})

print("Dates as strings:")
print(date_problems)
print(f"Date column type: {date_problems['date'].dtype}")
print()

# Can't do date math
try:
    days_since = pd.Timestamp.now() - date_problems['date']
except TypeError:
    print("ERROR: Can't subtract string from timestamp!")

### Example 11: Converting and Using DateTime
After converting strings to datetime objects, we can perform date arithmetic, extract components like day of week, and do time-based filtering.

In [None]:
# Fix by converting to datetime
date_problems['date'] = pd.to_datetime(date_problems['date'])
date_problems['days_ago'] = (pd.Timestamp.now() - date_problems['date']).dt.days

print("\nAfter converting to datetime:")
print(date_problems)
print(f"\nNow we can do time calculations!")

---

## Part 5: Data Type Detection and Inference
**How Pandas Guesses Types and When It Gets It Wrong**

### Example 12: Automatic Type Detection
Pandas tries to be smart about detecting data types, but it's conservative. This example shows what pandas can figure out automatically and where it plays it safe.

In [None]:
# Pandas tries to infer types when reading data
data = {
    'A': [1, 2, 3, 4],                    # Will be int64
    'B': [1.0, 2.5, 3.7, 4.2],           # Will be float64
    'C': ['x', 'y', 'z', 'w'],           # Will be object (string)
    'D': [True, False, True, False],      # Will be bool
    'E': ['1', '2', '3', '4'],           # Will be object (string) - not int!
}

df_inferred = pd.DataFrame(data)
print("Pandas type inference:")
print(df_inferred)
print("\nInferred types:")
print(df_inferred.dtypes)
print("\nNotice column E is string, not integer!")

### Example 13: Mixed Type Columns
When a column has mixed types, pandas defaults to 'object' type and many operations fail. This example shows how to identify and handle mixed-type columns.

In [None]:
# Mixed types cause problems
mixed_data = pd.DataFrame({
    'values': [1, 2, '3', 4, 5.0, 'six', 7, 8, 9, 10]
})

print("Mixed type column:")
print(mixed_data)
print(f"Type: {mixed_data['values'].dtype}")  # Object (generic)
print()

# Try to calculate mean - this will fail
try:
    mean_value = mixed_data['values'].mean()
except TypeError:
    print("ERROR: Can't calculate mean with mixed types!")
    
# Check what's numeric
mixed_data['is_numeric'] = pd.to_numeric(mixed_data['values'], errors='coerce').notna()
print("\nWhich values are numeric?")
print(mixed_data)

### Example 14: Explicit Type Specification
The safest approach is to explicitly specify data types when reading data. This example compares automatic detection with explicit type specification, showing how the latter prevents problems.

In [None]:
# Best practice: Explicitly set types when reading data
from io import StringIO

csv_data = """id,amount,date,category
1001,29.99,2024-01-15,Electronics
1002,45.50,2024-01-16,Clothing
1003,15.00,2024-01-17,Food"""

# Without specifying types
df_auto = pd.read_csv(StringIO(csv_data))
print("Automatic type detection:")
print(df_auto.dtypes)
print()

# With explicit types
df_explicit = pd.read_csv(
    StringIO(csv_data),
    dtype={'id': 'int32', 'category': 'category'},
    parse_dates=['date']
)
print("Explicit type specification:")
print(df_explicit.dtypes)
print()
print("Benefits: Less memory, faster operations, prevents errors!")