# Reading Data from Various Data Sources using Python

---

## Table of Contents
1. [Introduction to Data Sources](#introduction)
2. [Reading CSV Files](#csv)
3. [Reading Excel Files](#excel)
4. [Reading JSON Files](#json)
5. [Reading from Databases](#databases)
6. [Reading from Web](#web)
7. [Reading Other Formats](#other-formats)
8. [Writing Data](#writing)
9. [Data Import Best Practices](#best-practices)
10. [Practical Examples](#examples)
11. [Summary](#summary)

---

## 1. Introduction to Data Sources <a id='introduction'></a>

In data analysis and machine learning, **reading data from various sources** is the first critical step. Data can come from multiple formats and locations, and Python provides powerful tools to handle them all.

### Common Data Sources:

| Source Type | Format | Common Use Cases |
|-------------|--------|------------------|
| **Files** | CSV, Excel, JSON, Parquet | Local data storage, data exchange |
| **Databases** | SQL, NoSQL | Structured data storage, enterprise systems |
| **Web** | HTML, APIs, Scraping | Real-time data, web services |
| **Other** | Pickle, HDF5, Text | Specialized storage, large datasets |

### Primary Library: Pandas

**Pandas** is the go-to library for data manipulation in Python. It provides:
- **DataFrame**: 2D labeled data structure (like Excel spreadsheet)
- **Series**: 1D labeled array
- Built-in functions for reading/writing various formats
- Powerful data manipulation capabilities

### Why Different Formats?

- **CSV**: Simple, universal, human-readable
- **Excel**: Business standard, supports multiple sheets
- **JSON**: Web APIs, nested data structures
- **Parquet**: Big data, efficient compression
- **SQL**: Relational data, complex queries

Let's explore each format in detail!

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np

# Display pandas version
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

---

## 2. Reading CSV Files <a id='csv'></a>

**CSV (Comma-Separated Values)** is the most common data format for storing tabular data. It's simple, universal, and human-readable.

### Basic Syntax:
```python
df = pd.read_csv('filename.csv')
```

### Key Parameters:

| Parameter | Description | Example |
|-----------|-------------|----------|
| `filepath` | Path to file | `'data.csv'` or `'C:/data/file.csv'` |
| `sep` | Delimiter | `sep=','` (default), `sep='\t'` (tab) |
| `header` | Row number for column names | `header=0` (default), `header=None` |
| `names` | Custom column names | `names=['A', 'B', 'C']` |
| `usecols` | Columns to read | `usecols=[0, 1, 2]` or `usecols=['Name', 'Age']` |
| `skiprows` | Rows to skip | `skiprows=5` or `skiprows=[0, 2]` |
| `nrows` | Number of rows to read | `nrows=100` |
| `index_col` | Column to use as index | `index_col=0` or `index_col='ID'` |
| `dtype` | Data type for columns | `dtype={'Age': int}` |
| `parse_dates` | Parse date columns | `parse_dates=['Date']` |
| `na_values` | Additional NA values | `na_values=['NA', 'missing']` |
| `encoding` | File encoding | `encoding='utf-8'` or `encoding='latin1'` |

In [None]:
# Example 1: Basic CSV reading
# First, let's create a sample CSV file

import io

# Sample CSV data as string
csv_data = """Name,Age,City,Salary
John,28,New York,75000
Alice,34,Los Angeles,85000
Bob,45,Chicago,65000
Emma,29,Houston,70000
Michael,52,Phoenix,90000"""

# Read from string (simulating file)
df = pd.read_csv(io.StringIO(csv_data))

print("Basic CSV Reading:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

In [None]:
# Example 2: Reading with custom delimiter (tab-separated)

tsv_data = """Name\tAge\tCity\tSalary
John\t28\tNew York\t75000
Alice\t34\tLos Angeles\t85000
Bob\t45\tChicago\t65000"""

df_tsv = pd.read_csv(io.StringIO(tsv_data), sep='\t')
print("Tab-Separated Data:")
print(df_tsv)

In [None]:
# Example 3: Reading specific columns only

# Method 1: Using column indices
df_cols_idx = pd.read_csv(io.StringIO(csv_data), usecols=[0, 1])
print("Reading columns by index [0, 1]:")
print(df_cols_idx)
print()

# Method 2: Using column names
df_cols_names = pd.read_csv(io.StringIO(csv_data), usecols=['Name', 'Salary'])
print("Reading columns by name ['Name', 'Salary']:")
print(df_cols_names)

In [None]:
# Example 4: Handling missing headers and custom column names

# CSV without headers
csv_no_header = """John,28,New York,75000
Alice,34,Los Angeles,85000
Bob,45,Chicago,65000"""

# Read without header and assign custom names
df_custom = pd.read_csv(
    io.StringIO(csv_no_header), 
    header=None,  # No header in file
    names=['Employee_Name', 'Employee_Age', 'Location', 'Annual_Salary']
)

print("CSV with custom column names:")
print(df_custom)

In [None]:
# Example 5: Skipping rows and reading limited rows

csv_with_comments = """# This is a comment
# Data collected on 2024-01-01
Name,Age,City,Salary
John,28,New York,75000
Alice,34,Los Angeles,85000
Bob,45,Chicago,65000
Emma,29,Houston,70000"""

# Skip first 2 rows (comments) and read only 2 data rows
df_skip = pd.read_csv(
    io.StringIO(csv_with_comments), 
    skiprows=2,  # Skip first 2 rows
    nrows=2      # Read only 2 data rows
)

print("Skipping rows and limiting read:")
print(df_skip)

In [None]:
# Example 6: Setting index column

csv_with_id = """ID,Name,Age,City
101,John,28,New York
102,Alice,34,Los Angeles
103,Bob,45,Chicago
104,Emma,29,Houston"""

# Set 'ID' as index
df_indexed = pd.read_csv(io.StringIO(csv_with_id), index_col='ID')

print("DataFrame with custom index:")
print(df_indexed)
print(f"\nIndex: {df_indexed.index.tolist()}")

In [None]:
# Example 7: Specifying data types

csv_mixed = """Name,Age,Score,IsActive
John,28,85.5,True
Alice,34,92.3,False
Bob,45,78.9,True"""

# Specify dtypes for better performance and accuracy
df_typed = pd.read_csv(
    io.StringIO(csv_mixed),
    dtype={
        'Name': str,
        'Age': int,
        'Score': float,
        'IsActive': bool
    }
)

print("DataFrame with specified dtypes:")
print(df_typed)
print(f"\nData types:\n{df_typed.dtypes}")

In [None]:
# Example 8: Parsing dates

csv_dates = """Date,Product,Sales
2024-01-01,Widget A,150
2024-01-02,Widget B,200
2024-01-03,Widget A,175
2024-01-04,Widget C,300"""

# Parse Date column as datetime
df_dates = pd.read_csv(
    io.StringIO(csv_dates),
    parse_dates=['Date']  # Automatically convert to datetime
)

print("DataFrame with parsed dates:")
print(df_dates)
print(f"\nDate column dtype: {df_dates['Date'].dtype}")

In [None]:
# Example 9: Handling missing values

csv_missing = """Name,Age,City,Salary
John,28,New York,75000
Alice,NA,Los Angeles,85000
Bob,45,missing,65000
Emma,29,Houston,N/A"""

# Specify additional NA values
df_na = pd.read_csv(
    io.StringIO(csv_missing),
    na_values=['NA', 'missing', 'N/A', 'null']  # Treat these as NaN
)

print("DataFrame with handled missing values:")
print(df_na)
print(f"\nMissing values per column:\n{df_na.isnull().sum()}")

In [None]:
# Example 10: Reading large files in chunks

# Create sample large CSV
large_csv_data = "Name,Age,Score\n" + "\n".join(
    [f"Person{i},{20+i%30},{50+i%50}" for i in range(20)]
)

# Read in chunks (useful for very large files)
chunk_size = 5
chunks = []

for i, chunk in enumerate(pd.read_csv(io.StringIO(large_csv_data), chunksize=chunk_size)):
    print(f"Chunk {i+1}:")
    print(chunk)
    print()
    chunks.append(chunk)
    if i == 2:  # Show only first 3 chunks
        print("... (more chunks)")
        break

# Combine chunks if needed
# df_full = pd.concat(chunks, ignore_index=True)

---

## 3. Reading Excel Files <a id='excel'></a>

**Excel files** (.xlsx, .xls) are widely used in business environments. Pandas can read and write Excel files using the `openpyxl` or `xlrd` engines.

### Installation:
```bash
pip install openpyxl  # For .xlsx files
pip install xlrd      # For .xls files (older format)
```

### Basic Syntax:
```python
df = pd.read_excel('filename.xlsx')
```

### Key Parameters:

| Parameter | Description | Example |
|-----------|-------------|----------|
| `sheet_name` | Sheet to read | `sheet_name=0` or `sheet_name='Sheet1'` |
| `header` | Row for column names | `header=0` (default) |
| `usecols` | Columns to read | `usecols='A:C'` or `usecols=[0, 1, 2]` |
| `skiprows` | Rows to skip | `skiprows=2` |
| `nrows` | Number of rows | `nrows=100` |
| `engine` | Excel engine | `engine='openpyxl'` |

In [None]:
# Example 1: Creating and reading Excel file

# Create sample data
data = {
    'Name': ['John', 'Alice', 'Bob', 'Emma', 'Michael'],
    'Age': [28, 34, 45, 29, 52],
    'Department': ['IT', 'HR', 'Finance', 'IT', 'Marketing'],
    'Salary': [75000, 85000, 65000, 70000, 90000]
}

df_sample = pd.DataFrame(data)

# Write to Excel (we'll use this for demonstration)
# df_sample.to_excel('employees.xlsx', index=False)

print("Sample data created:")
print(df_sample)

In [None]:
# Example 2: Reading specific sheet by name or index

# If you have a real Excel file:
# df = pd.read_excel('data.xlsx', sheet_name='Sheet1')  # By name
# df = pd.read_excel('data.xlsx', sheet_name=0)         # By index (first sheet)

# Demonstration with in-memory data
print("Reading Excel file:")
print("- sheet_name='Sheet1' reads the sheet named 'Sheet1'")
print("- sheet_name=0 reads the first sheet")
print("- sheet_name=None reads all sheets (returns dict of DataFrames)")

In [None]:
# Example 3: Reading multiple sheets

# Create multiple sheets for demonstration
data_sheet1 = {'Product': ['A', 'B', 'C'], 'Price': [100, 200, 150]}
data_sheet2 = {'Product': ['D', 'E', 'F'], 'Price': [300, 250, 180]}

# Write to Excel with multiple sheets
# with pd.ExcelWriter('multi_sheet.xlsx') as writer:
#     pd.DataFrame(data_sheet1).to_excel(writer, sheet_name='Q1', index=False)
#     pd.DataFrame(data_sheet2).to_excel(writer, sheet_name='Q2', index=False)

# Read all sheets
# excel_file = pd.read_excel('multi_sheet.xlsx', sheet_name=None)
# print("Available sheets:", excel_file.keys())
# print("Q1 data:\n", excel_file['Q1'])
# print("Q2 data:\n", excel_file['Q2'])

print("To read all sheets: sheet_name=None")
print("Returns a dictionary: {'Sheet1': df1, 'Sheet2': df2, ...}")
print("\nExample:")
print(pd.DataFrame(data_sheet1))
print("\n(This would be one of the sheets)")

In [None]:
# Example 4: Reading specific columns from Excel

# Method 1: Using Excel column letters
# df = pd.read_excel('data.xlsx', usecols='A:C')  # Columns A, B, C

# Method 2: Using column indices
# df = pd.read_excel('data.xlsx', usecols=[0, 1, 2])  # First 3 columns

# Method 3: Using column names
# df = pd.read_excel('data.xlsx', usecols=['Name', 'Age', 'Salary'])

print("Reading specific columns from Excel:")
print("\nMethod 1 - Column letters: usecols='A:D'")
print("Method 2 - Column indices: usecols=[0, 1, 2, 3]")
print("Method 3 - Column names: usecols=['Name', 'Age', 'Salary']")

In [None]:
# Example 5: Writing to Excel with formatting

# Create sample data
sales_data = {
    'Product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'Q1_Sales': [15000, 23000, 18000, 31000],
    'Q2_Sales': [17000, 25000, 19000, 33000],
    'Q3_Sales': [16000, 24000, 21000, 35000]
}

df_sales = pd.DataFrame(sales_data)
df_sales['Total'] = df_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)

print("Sales data to export:")
print(df_sales)

# Writing to Excel (demonstration)
# df_sales.to_excel('sales_report.xlsx', sheet_name='Sales', index=False)

print("\nExported to Excel with to_excel() method")

In [None]:
# Example 6: Writing multiple DataFrames to different sheets

# Create multiple DataFrames
df_employees = pd.DataFrame({
    'Name': ['John', 'Alice', 'Bob'],
    'Department': ['IT', 'HR', 'Finance']
})

df_departments = pd.DataFrame({
    'Department': ['IT', 'HR', 'Finance'],
    'Budget': [500000, 300000, 400000]
})

# Write to multiple sheets
# with pd.ExcelWriter('company_data.xlsx', engine='openpyxl') as writer:
#     df_employees.to_excel(writer, sheet_name='Employees', index=False)
#     df_departments.to_excel(writer, sheet_name='Departments', index=False)

print("Employees sheet:")
print(df_employees)
print("\nDepartments sheet:")
print(df_departments)
print("\nBoth written to different sheets in same Excel file")

---

## 4. Reading JSON Files <a id='json'></a>

**JSON (JavaScript Object Notation)** is a lightweight data format commonly used for web APIs and configuration files. It supports nested structures.

### Basic Syntax:
```python
df = pd.read_json('data.json')
```

### JSON Orientations:

| Orientation | Description | Structure |
|-------------|-------------|----------|
| `'split'` | Dict with index, columns, data | `{"index": [...], "columns": [...], "data": [...]}` |
| `'records'` | List of records (most common) | `[{"col1": val1, "col2": val2}, ...]` |
| `'index'` | Dict of dicts indexed by row | `{"row1": {"col1": val1}, ...}` |
| `'columns'` | Dict of dicts indexed by column | `{"col1": {"row1": val1}, ...}` |
| `'values'` | Just the values array | `[[val1, val2], [val3, val4], ...]` |

In [None]:
# Example 1: Reading JSON from string (records orientation)

import json

# JSON in 'records' format (most common)
json_records = '''
[
    {"Name": "John", "Age": 28, "City": "New York"},
    {"Name": "Alice", "Age": 34, "City": "Los Angeles"},
    {"Name": "Bob", "Age": 45, "City": "Chicago"}
]
'''

df_json = pd.read_json(json_records, orient='records')

print("JSON (records orientation):")
print(df_json)

In [None]:
# Example 2: Different JSON orientations

# Create sample DataFrame
df_sample = pd.DataFrame({
    'Name': ['John', 'Alice'],
    'Age': [28, 34],
    'City': ['New York', 'LA']
})

print("Original DataFrame:")
print(df_sample)
print()

# Convert to different JSON formats
print("\n1. RECORDS orientation (list of dicts):")
print(df_sample.to_json(orient='records', indent=2))

print("\n2. COLUMNS orientation (dict of columns):")
print(df_sample.to_json(orient='columns', indent=2))

print("\n3. INDEX orientation (dict of rows):")
print(df_sample.to_json(orient='index', indent=2))

In [None]:
# Example 3: Handling nested JSON

# Nested JSON structure
nested_json = '''
[
    {
        "name": "John",
        "age": 28,
        "address": {
            "city": "New York",
            "zipcode": "10001"
        },
        "skills": ["Python", "SQL"]
    },
    {
        "name": "Alice",
        "age": 34,
        "address": {
            "city": "Los Angeles",
            "zipcode": "90001"
        },
        "skills": ["Java", "JavaScript"]
    }
]
'''

# Read JSON
df_nested = pd.read_json(nested_json)

print("Nested JSON DataFrame:")
print(df_nested)
print(f"\nData types:\n{df_nested.dtypes}")

In [None]:
# Example 4: Normalizing nested JSON (flattening)

from pandas import json_normalize

# Parse JSON string
data = json.loads(nested_json)

# Normalize (flatten) nested structure
df_normalized = json_normalize(data)

print("Normalized (flattened) JSON:")
print(df_normalized)
print(f"\nColumns: {df_normalized.columns.tolist()}")

In [None]:
# Example 5: Reading JSON from API-like structure

# JSON with metadata (common in APIs)
api_json = '''
{
    "status": "success",
    "count": 3,
    "data": [
        {"id": 1, "product": "Widget A", "price": 100},
        {"id": 2, "product": "Widget B", "price": 200},
        {"id": 3, "product": "Widget C", "price": 150}
    ]
}
'''

# Parse JSON and extract data
json_data = json.loads(api_json)
df_api = pd.DataFrame(json_data['data'])

print(f"API Status: {json_data['status']}")
print(f"Record Count: {json_data['count']}")
print("\nData:")
print(df_api)

In [None]:
# Example 6: Writing DataFrame to JSON

data = {
    'Product': ['A', 'B', 'C'],
    'Price': [100, 200, 150],
    'Stock': [50, 30, 45]
}

df_products = pd.DataFrame(data)

# Write to JSON file (demonstration)
# df_products.to_json('products.json', orient='records', indent=2)

print("DataFrame:")
print(df_products)
print("\nJSON output (records):")
print(df_products.to_json(orient='records', indent=2))

---

## 5. Reading from Databases <a id='databases'></a>

**Databases** are the primary storage for structured data in production systems. Pandas can read data directly from SQL databases.

### Database Types:
- **SQLite**: Lightweight, serverless (built-in with Python)
- **MySQL/PostgreSQL**: Full-featured relational databases
- **SQL Server, Oracle**: Enterprise databases

### Required Libraries:
```bash
pip install sqlalchemy  # Database toolkit
pip install sqlite3     # Built-in for SQLite
```

### Basic Syntax:
```python
df = pd.read_sql('SELECT * FROM table_name', connection)
df = pd.read_sql_query('SELECT ...', connection)
df = pd.read_sql_table('table_name', connection)
```

In [None]:
# Example 1: Creating SQLite database and table

import sqlite3

# Create in-memory SQLite database
conn = sqlite3.connect(':memory:')  # In-memory database for demo

# Create sample table
cursor = conn.cursor()
cursor.execute('''
    CREATE TABLE employees (
        id INTEGER PRIMARY KEY,
        name TEXT,
        age INTEGER,
        department TEXT,
        salary REAL
    )
''')

# Insert sample data
employees_data = [
    (1, 'John', 28, 'IT', 75000),
    (2, 'Alice', 34, 'HR', 85000),
    (3, 'Bob', 45, 'Finance', 65000),
    (4, 'Emma', 29, 'IT', 70000),
    (5, 'Michael', 52, 'Marketing', 90000)
]

cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', employees_data)
conn.commit()

print("SQLite database created with 'employees' table")
print(f"Inserted {len(employees_data)} records")

In [None]:
# Example 2: Reading entire table

# Read entire table
df_employees = pd.read_sql('SELECT * FROM employees', conn)

print("All employees:")
print(df_employees)

In [None]:
# Example 3: Reading with SQL queries

# Query 1: Filter by department
query1 = "SELECT * FROM employees WHERE department = 'IT'"
df_it = pd.read_sql_query(query1, conn)

print("IT Department employees:")
print(df_it)
print()

# Query 2: Aggregate query
query2 = '''
    SELECT department, 
           COUNT(*) as employee_count, 
           AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
'''
df_dept_stats = pd.read_sql_query(query2, conn)

print("Department statistics:")
print(df_dept_stats)

In [None]:
# Example 4: Reading with parameterized queries (safe from SQL injection)

# Parameterized query (safer)
min_salary = 70000
query = "SELECT * FROM employees WHERE salary >= ?"

df_high_salary = pd.read_sql_query(query, conn, params=(min_salary,))

print(f"Employees with salary >= {min_salary}:")
print(df_high_salary)

In [None]:
# Example 5: Writing DataFrame to SQL database

# Create new DataFrame
new_employees = pd.DataFrame({
    'id': [6, 7, 8],
    'name': ['Sarah', 'Tom', 'Lisa'],
    'age': [31, 38, 27],
    'department': ['IT', 'HR', 'Finance'],
    'salary': [78000, 82000, 68000]
})

print("New employees to add:")
print(new_employees)

# Write to database
new_employees.to_sql(
    'employees',           # Table name
    conn,                  # Connection
    if_exists='append',    # Append to existing table
    index=False            # Don't write index
)

# Verify
df_all = pd.read_sql('SELECT * FROM employees', conn)
print(f"\nTotal employees after insert: {len(df_all)}")
print(df_all)

In [None]:
# Example 6: Using SQLAlchemy (recommended for production)

from sqlalchemy import create_engine

# Create engine (for SQLite)
engine = create_engine('sqlite:///:memory:')

# Write DataFrame to database
df_employees.to_sql('employees_new', engine, index=False, if_exists='replace')

# Read back
df_read = pd.read_sql('SELECT * FROM employees_new', engine)

print("Data read using SQLAlchemy:")
print(df_read)

# For other databases:
# MySQL: engine = create_engine('mysql+pymysql://user:password@localhost/dbname')
# PostgreSQL: engine = create_engine('postgresql://user:password@localhost/dbname')

In [None]:
# Clean up: Close connection
conn.close()
print("Database connection closed")

---

## 6. Reading from Web <a id='web'></a>

**Web data** can come from HTML tables, APIs, or web scraping. Pandas provides built-in support for reading HTML tables.

### Methods:
1. **pd.read_html()**: Extract tables from HTML
2. **requests + JSON**: Fetch data from APIs
3. **BeautifulSoup**: Web scraping (advanced)

### Installation:
```bash
pip install lxml html5lib beautifulsoup4 requests
```

In [None]:
# Example 1: Reading HTML tables with pd.read_html()

# Sample HTML with table
html_data = '''
<html>
<body>
    <h1>Employee Data</h1>
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>Department</th>
        </tr>
        <tr>
            <td>John</td>
            <td>28</td>
            <td>IT</td>
        </tr>
        <tr>
            <td>Alice</td>
            <td>34</td>
            <td>HR</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>45</td>
            <td>Finance</td>
        </tr>
    </table>
</body>
</html>
'''

# Read HTML tables (returns list of DataFrames)
tables = pd.read_html(io.StringIO(html_data))

print(f"Number of tables found: {len(tables)}")
print("\nFirst table:")
print(tables[0])

In [None]:
# Example 2: Reading from URL (demonstration)

# Read tables from Wikipedia (example)
# url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
# tables = pd.read_html(url)
# df_gdp = tables[0]  # First table

print("Reading from URL:")
print("tables = pd.read_html('https://example.com/page.html')")
print("df = tables[0]  # Get first table")
print("\nNote: Requires internet connection and valid URL")

In [None]:
# Example 3: Fetching JSON data from API

# Simulating API response
api_response = {
    'status': 'success',
    'data': [
        {'id': 1, 'name': 'John', 'email': 'john@example.com'},
        {'id': 2, 'name': 'Alice', 'email': 'alice@example.com'},
        {'id': 3, 'name': 'Bob', 'email': 'bob@example.com'}
    ]
}

# Convert API response to DataFrame
df_api = pd.DataFrame(api_response['data'])

print("API data converted to DataFrame:")
print(df_api)

# Real API example (requires requests library):
# import requests
# response = requests.get('https://api.example.com/data')
# data = response.json()
# df = pd.DataFrame(data)

In [None]:
# Example 4: Reading CSV from URL

# Pandas can read CSV directly from URL
# url = 'https://raw.githubusercontent.com/example/data.csv'
# df = pd.read_csv(url)

print("Reading CSV from URL:")
print("df = pd.read_csv('https://example.com/data.csv')")
print("\nWorks with any publicly accessible CSV file")

In [None]:
# Example 5: Web scraping with BeautifulSoup (basic)

from bs4 import BeautifulSoup

# Sample HTML
html = '''
<div class="products">
    <div class="product">
        <span class="name">Widget A</span>
        <span class="price">$100</span>
    </div>
    <div class="product">
        <span class="name">Widget B</span>
        <span class="price">$200</span>
    </div>
</div>
'''

# Parse HTML
soup = BeautifulSoup(html, 'html.parser')

# Extract data
products = []
for product in soup.find_all('div', class_='product'):
    name = product.find('span', class_='name').text
    price = product.find('span', class_='price').text
    products.append({'Product': name, 'Price': price})

# Create DataFrame
df_scraped = pd.DataFrame(products)

print("Scraped data:")
print(df_scraped)

---

## 7. Reading Other Formats <a id='other-formats'></a>

Pandas supports many specialized data formats for different use cases.

### Format Comparison:

| Format | Use Case | Pros | Cons |
|--------|----------|------|------|
| **Parquet** | Big data, analytics | Fast, compressed, typed | Binary format |
| **HDF5** | Scientific data | Hierarchical, fast | Complex |
| **Pickle** | Python objects | Preserves types | Python-only, security risk |
| **Feather** | Data exchange | Very fast | Limited compression |
| **Text** | Simple data | Human-readable | Limited structure |

In [None]:
# Example 1: Parquet files (efficient columnar storage)

# Create sample data
df_sample = pd.DataFrame({
    'id': range(1, 6),
    'name': ['John', 'Alice', 'Bob', 'Emma', 'Michael'],
    'salary': [75000, 85000, 65000, 70000, 90000],
    'hire_date': pd.date_range('2020-01-01', periods=5, freq='M')
})

print("Sample DataFrame:")
print(df_sample)

# Writing to Parquet (requires pyarrow or fastparquet)
# df_sample.to_parquet('data.parquet', engine='pyarrow')

# Reading from Parquet
# df_read = pd.read_parquet('data.parquet')

print("\nParquet format:")
print("- Highly compressed")
print("- Preserves data types")
print("- Fast for big data")
print("\nRequires: pip install pyarrow")

In [None]:
# Example 2: Pickle files (Python serialization)

import pickle

# Create DataFrame with complex types
df_complex = pd.DataFrame({
    'name': ['John', 'Alice', 'Bob'],
    'scores': [[85, 90, 88], [92, 87, 95], [78, 82, 80]],  # Lists
    'metadata': [{'age': 28}, {'age': 34}, {'age': 45}]     # Dicts
})

print("DataFrame with complex types:")
print(df_complex)

# Save to pickle
# df_complex.to_pickle('data.pkl')

# Load from pickle
# df_loaded = pd.read_pickle('data.pkl')

print("\nPickle format:")
print("- Preserves all Python objects")
print("- Fast read/write")
print("- WARNING: Only load from trusted sources!")

In [None]:
# Example 3: HDF5 files (hierarchical data)

# Create sample data
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'product': ['A', 'B', 'A', 'C', 'B'],
    'sales': [100, 200, 150, 300, 250]
})

print("Sales data:")
print(df_sales)

# Writing to HDF5 (requires tables/pytables)
# df_sales.to_hdf('data.h5', key='sales', mode='w')

# Reading from HDF5
# df_read = pd.read_hdf('data.h5', key='sales')

print("\nHDF5 format:")
print("- Hierarchical structure (multiple datasets)")
print("- Fast I/O for large datasets")
print("- Supports compression")
print("\nRequires: pip install tables")

In [None]:
# Example 4: Reading plain text files

# Text file with structured data
text_data = """Name: John, Age: 28, City: New York
Name: Alice, Age: 34, City: Los Angeles
Name: Bob, Age: 45, City: Chicago"""

# Parse text manually
lines = text_data.strip().split('\n')
records = []

for line in lines:
    parts = line.split(', ')
    record = {}
    for part in parts:
        key, value = part.split(': ')
        record[key] = value
    records.append(record)

df_text = pd.DataFrame(records)

print("Parsed text data:")
print(df_text)

In [None]:
# Example 5: Reading clipboard data

# Copy data to clipboard first, then:
# df = pd.read_clipboard()

print("Reading from clipboard:")
print("1. Copy tabular data (from Excel, web table, etc.)")
print("2. Run: df = pd.read_clipboard()")
print("3. Data is automatically parsed into DataFrame")
print("\nVery useful for quick data imports!")

In [None]:
# Example 6: Feather format (fast data exchange)

# Create sample data
df_sample = pd.DataFrame({
    'A': range(100),
    'B': np.random.randn(100),
    'C': ['text'] * 100
})

# Write to Feather
# df_sample.to_feather('data.feather')

# Read from Feather
# df_read = pd.read_feather('data.feather')

print("Feather format:")
print("- Very fast read/write")
print("- Language-agnostic (works with R, Python, etc.)")
print("- Good for temporary storage")
print(f"\nSample data shape: {df_sample.shape}")

---

## 8. Writing Data <a id='writing'></a>

After processing data, you often need to export it to various formats. Pandas provides comprehensive writing functions.

### Common Export Methods:

| Method | Description | Usage |
|--------|-------------|-------|
| `to_csv()` | Export to CSV | `df.to_csv('file.csv')` |
| `to_excel()` | Export to Excel | `df.to_excel('file.xlsx')` |
| `to_json()` | Export to JSON | `df.to_json('file.json')` |
| `to_sql()` | Export to database | `df.to_sql('table', conn)` |
| `to_parquet()` | Export to Parquet | `df.to_parquet('file.parquet')` |
| `to_pickle()` | Export to Pickle | `df.to_pickle('file.pkl')` |
| `to_html()` | Export to HTML | `df.to_html('file.html')` |

In [None]:
# Create sample DataFrame for export examples

df_export = pd.DataFrame({
    'Product': ['Widget A', 'Widget B', 'Widget C', 'Widget D'],
    'Category': ['Electronics', 'Home', 'Electronics', 'Sports'],
    'Price': [299.99, 49.99, 199.99, 89.99],
    'Stock': [50, 120, 35, 80],
    'Rating': [4.5, 4.2, 4.8, 4.1]
})

print("Sample data for export:")
print(df_export)

In [None]:
# Example 1: Exporting to CSV with options

# Basic export
# df_export.to_csv('products.csv', index=False)

# Export with custom settings
# df_export.to_csv(
#     'products_custom.csv',
#     index=False,           # Don't write row numbers
#     sep=';',               # Use semicolon separator
#     encoding='utf-8',      # UTF-8 encoding
#     float_format='%.2f',   # 2 decimal places for floats
#     columns=['Product', 'Price', 'Stock']  # Select columns
# )

print("CSV export options:")
print("- index=False: Don't write index")
print("- sep=';': Custom delimiter")
print("- encoding='utf-8': Character encoding")
print("- float_format='%.2f': Number formatting")
print("- columns=[...]: Select specific columns")

In [None]:
# Example 2: Exporting to Excel with formatting

# Single sheet
# df_export.to_excel('products.xlsx', sheet_name='Products', index=False)

# Multiple sheets with formatting
# with pd.ExcelWriter('sales_report.xlsx', engine='openpyxl') as writer:
#     df_export.to_excel(writer, sheet_name='Products', index=False)
#     df_export[df_export['Category'] == 'Electronics'].to_excel(
#         writer, sheet_name='Electronics', index=False
#     )

print("Excel export features:")
print("- Multiple sheets")
print("- Custom sheet names")
print("- Formatting options")
print("- Formulas (with xlsxwriter)")

In [None]:
# Example 3: Exporting to JSON with different orientations

# Records format (most common for APIs)
json_records = df_export.to_json(orient='records', indent=2)
print("JSON (records format):")
print(json_records)
print()

# Columns format
json_columns = df_export.to_json(orient='columns', indent=2)
print("JSON (columns format):")
print(json_columns[:200], "...")  # Show first 200 chars

In [None]:
# Example 4: Exporting to SQL database

# Create database connection
conn = sqlite3.connect(':memory:')

# Write to SQL
df_export.to_sql(
    'products',           # Table name
    conn,                 # Connection
    if_exists='replace',  # Replace if exists
    index=False           # Don't write index
)

# Verify
df_verify = pd.read_sql('SELECT * FROM products', conn)
print("Data written to SQL and read back:")
print(df_verify)

conn.close()

In [None]:
# Example 5: Exporting to HTML

# Convert DataFrame to HTML table
html_table = df_export.to_html(
    index=False,
    classes='table table-striped',  # CSS classes
    border=0
)

print("HTML table:")
print(html_table)

In [None]:
# Example 6: Exporting to clipboard

# Copy to clipboard
# df_export.to_clipboard(index=False)

print("Clipboard export:")
print("df.to_clipboard(index=False)")
print("\nData is copied and can be pasted into Excel, Google Sheets, etc.")
print("Very convenient for quick data sharing!")

---

## 9. Data Import Best Practices <a id='best-practices'></a>

Following best practices ensures reliable and efficient data imports.

### Key Principles:

1. **Always validate data after import**
2. **Handle encoding issues early**
3. **Use appropriate data types**
4. **Handle large files efficiently**
5. **Check for missing values**
6. **Verify data integrity**

In [None]:
# Best Practice 1: Data validation after import

# Create sample data with issues
csv_with_issues = """Name,Age,Salary,Department
John,28,75000,IT
Alice,34,85000,HR
Bob,invalid,65000,Finance
Emma,29,,IT
Michael,52,90000,Marketing"""

df_validate = pd.read_csv(io.StringIO(csv_with_issues))

print("Imported data:")
print(df_validate)
print()

# Validation steps
print("Data Validation:")
print(f"1. Shape: {df_validate.shape}")
print(f"2. Data types:\n{df_validate.dtypes}")
print(f"\n3. Missing values:\n{df_validate.isnull().sum()}")
print(f"\n4. Duplicates: {df_validate.duplicated().sum()}")
print(f"\n5. Basic statistics:\n{df_validate.describe()}")

In [None]:
# Best Practice 2: Handling encoding issues

# Common encodings to try
encodings = ['utf-8', 'latin1', 'iso-8859-1', 'cp1252']

def read_with_encoding(filepath):
    """Try different encodings until one works"""
    for encoding in encodings:
        try:
            df = pd.read_csv(filepath, encoding=encoding)
            print(f"Successfully read with encoding: {encoding}")
            return df
        except UnicodeDecodeError:
            continue
    raise ValueError("Could not read file with any encoding")

print("Encoding handling:")
print("1. Try UTF-8 first (most common)")
print("2. Fall back to latin1 or cp1252 for Windows files")
print("3. Use errors='ignore' or errors='replace' as last resort")

In [None]:
# Best Practice 3: Efficient handling of large files

def process_large_file(filepath, chunksize=10000):
    """Process large file in chunks to avoid memory issues"""
    # Example: Calculate statistics without loading entire file
    total_rows = 0
    sum_values = 0
    
    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        total_rows += len(chunk)
        # Process chunk here
        # sum_values += chunk['column_name'].sum()
    
    return total_rows

print("Large file handling:")
print("1. Use chunksize parameter")
print("2. Use usecols to read only needed columns")
print("3. Use dtype to reduce memory (int8 vs int64)")
print("4. Consider Parquet format for big data")
print("5. Use Dask for very large datasets")

In [None]:
# Best Practice 4: Type conversion and validation

csv_data = """Name,Age,Salary,StartDate
John,28,75000,2020-01-15
Alice,34,85000,2018-06-20
Bob,45,65000,2015-03-10"""

# Read with proper types
df_typed = pd.read_csv(
    io.StringIO(csv_data),
    dtype={'Name': str, 'Age': 'Int64', 'Salary': 'Int64'},
    parse_dates=['StartDate']
)

print("Properly typed DataFrame:")
print(df_typed)
print(f"\nData types:\n{df_typed.dtypes}")
print(f"\nMemory usage: {df_typed.memory_usage(deep=True).sum()} bytes")

In [None]:
# Best Practice 5: Comprehensive data quality check

def data_quality_report(df):
    """Generate comprehensive data quality report"""
    print("=" * 60)
    print("DATA QUALITY REPORT")
    print("=" * 60)
    
    print(f"\n1. BASIC INFO:")
    print(f"   Shape: {df.shape}")
    print(f"   Memory: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
    
    print(f"\n2. DATA TYPES:")
    print(df.dtypes)
    
    print(f"\n3. MISSING VALUES:")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    print(pd.DataFrame({'Count': missing, 'Percentage': missing_pct}))
    
    print(f"\n4. DUPLICATES:")
    print(f"   Duplicate rows: {df.duplicated().sum()}")
    
    print(f"\n5. NUMERIC SUMMARY:")
    print(df.describe())
    
    print("\n" + "=" * 60)

# Example usage
df_sample = pd.DataFrame({
    'A': [1, 2, 3, None, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['x', 'y', 'z', 'x', 'y']
})

data_quality_report(df_sample)

In [None]:
# Best Practice 6: Safe file operations with context managers

def safe_read_csv(filepath, **kwargs):
    """Safely read CSV with error handling"""
    try:
        df = pd.read_csv(filepath, **kwargs)
        print(f"Successfully read {len(df)} rows from {filepath}")
        return df
    except FileNotFoundError:
        print(f"Error: File {filepath} not found")
        return None
    except pd.errors.EmptyDataError:
        print(f"Error: File {filepath} is empty")
        return None
    except Exception as e:
        print(f"Error reading file: {str(e)}")
        return None

print("Safe file reading:")
print("- Always use try-except blocks")
print("- Validate file exists before reading")
print("- Handle specific exceptions")
print("- Log errors for debugging")

---

## 10. Practical Examples <a id='examples'></a>

Real-world scenarios combining multiple data sources and techniques.

In [None]:
# Example 1: Combining data from multiple CSV files

# Simulate multiple CSV files (monthly sales data)
jan_data = """Date,Product,Sales
2024-01-01,Widget A,100
2024-01-02,Widget B,150
2024-01-03,Widget A,120"""

feb_data = """Date,Product,Sales
2024-02-01,Widget A,130
2024-02-02,Widget B,170
2024-02-03,Widget A,140"""

mar_data = """Date,Product,Sales
2024-03-01,Widget A,160
2024-03-02,Widget B,190
2024-03-03,Widget A,150"""

# Read and combine
df_jan = pd.read_csv(io.StringIO(jan_data), parse_dates=['Date'])
df_feb = pd.read_csv(io.StringIO(feb_data), parse_dates=['Date'])
df_mar = pd.read_csv(io.StringIO(mar_data), parse_dates=['Date'])

# Concatenate
df_q1 = pd.concat([df_jan, df_feb, df_mar], ignore_index=True)

print("Q1 Sales Data (combined):")
print(df_q1)
print(f"\nTotal records: {len(df_q1)}")
print(f"\nTotal sales by product:")
print(df_q1.groupby('Product')['Sales'].sum())

In [None]:
# Example 2: Merging data from different sources

# Employee data from CSV
employees_csv = """EmployeeID,Name,DepartmentID
1,John,101
2,Alice,102
3,Bob,101
4,Emma,103"""

# Department data from JSON
departments_json = '''
[
    {"DepartmentID": 101, "DepartmentName": "IT", "Budget": 500000},
    {"DepartmentID": 102, "DepartmentName": "HR", "Budget": 300000},
    {"DepartmentID": 103, "DepartmentName": "Finance", "Budget": 400000}
]
'''

# Read both sources
df_employees = pd.read_csv(io.StringIO(employees_csv))
df_departments = pd.read_json(departments_json)

print("Employees:")
print(df_employees)
print("\nDepartments:")
print(df_departments)

# Merge on DepartmentID
df_merged = pd.merge(df_employees, df_departments, on='DepartmentID', how='left')

print("\nMerged data:")
print(df_merged)

In [None]:
# Example 3: ETL Pipeline (Extract, Transform, Load)

def etl_pipeline():
    """Complete ETL pipeline example"""
    
    # EXTRACT: Read from multiple sources
    print("EXTRACT PHASE:")
    
    # Source 1: CSV
    sales_csv = """OrderID,ProductID,Quantity,Price
1001,P1,5,100
1002,P2,3,200
1003,P1,2,100"""
    df_sales = pd.read_csv(io.StringIO(sales_csv))
    print("Sales data extracted")
    
    # Source 2: JSON (product info)
    products_json = '[{"ProductID":"P1","Name":"Widget A"},{"ProductID":"P2","Name":"Widget B"}]'
    df_products = pd.read_json(products_json)
    print("Product data extracted")
    
    # TRANSFORM: Clean and combine
    print("\nTRANSFORM PHASE:")
    
    # Calculate total
    df_sales['Total'] = df_sales['Quantity'] * df_sales['Price']
    
    # Merge with product info
    df_final = pd.merge(df_sales, df_products, on='ProductID')
    
    # Reorder columns
    df_final = df_final[['OrderID', 'ProductID', 'Name', 'Quantity', 'Price', 'Total']]
    
    print("Data transformed")
    
    # LOAD: Save to database
    print("\nLOAD PHASE:")
    conn = sqlite3.connect(':memory:')
    df_final.to_sql('order_details', conn, index=False, if_exists='replace')
    print("Data loaded to database")
    
    # Verify
    df_verify = pd.read_sql('SELECT * FROM order_details', conn)
    print("\nFinal result:")
    print(df_verify)
    
    conn.close()
    return df_final

result = etl_pipeline()

In [None]:
# Example 4: Reading and analyzing time series data

# Stock price data
stock_csv = """Date,Symbol,Open,High,Low,Close,Volume
2024-01-01,AAPL,180.50,185.20,179.80,184.00,50000000
2024-01-02,AAPL,184.50,186.30,183.20,185.50,52000000
2024-01-03,AAPL,185.00,187.50,184.00,186.80,48000000
2024-01-04,AAPL,187.00,188.20,185.50,187.30,51000000
2024-01-05,AAPL,187.50,189.00,186.80,188.50,53000000"""

# Read with date parsing and indexing
df_stock = pd.read_csv(
    io.StringIO(stock_csv),
    parse_dates=['Date'],
    index_col='Date'
)

print("Stock data:")
print(df_stock)

# Calculate daily returns
df_stock['Daily_Return'] = df_stock['Close'].pct_change() * 100

# Calculate moving average
df_stock['MA_3'] = df_stock['Close'].rolling(window=3).mean()

print("\nWith calculated metrics:")
print(df_stock[['Close', 'Daily_Return', 'MA_3']])

In [None]:
# Example 5: Data quality monitoring

def monitor_data_quality(df, report_name="Data Quality Report"):
    """Generate automated data quality monitoring report"""
    
    issues = []
    
    # Check 1: Missing values
    missing = df.isnull().sum()
    if missing.any():
        issues.append(f"Missing values detected in: {missing[missing > 0].to_dict()}")
    
    # Check 2: Duplicates
    dup_count = df.duplicated().sum()
    if dup_count > 0:
        issues.append(f"Found {dup_count} duplicate rows")
    
    # Check 3: Data types
    for col in df.columns:
        if df[col].dtype == 'object':
            # Check if numeric column stored as string
            try:
                pd.to_numeric(df[col])
                issues.append(f"Column '{col}' appears numeric but stored as text")
            except:
                pass
    
    # Generate report
    print(f"\n{'='*60}")
    print(f"{report_name}")
    print(f"{'='*60}")
    
    if issues:
        print("\nISSUES FOUND:")
        for i, issue in enumerate(issues, 1):
            print(f"{i}. {issue}")
    else:
        print("\nNo issues found. Data quality is good!")
    
    print(f"\nSummary: {len(df)} rows, {len(df.columns)} columns")
    print(f"{'='*60}")

# Test with problematic data
test_data = pd.DataFrame({
    'A': [1, 2, 3, 3, None],
    'B': ['100', '200', '300', '400', '500'],  # Should be numeric
    'C': ['x', 'y', 'z', 'x', 'w']
})

monitor_data_quality(test_data, "Test Data Quality Check")

---

## 11. Summary <a id='summary'></a>

### Key Takeaways:

#### 1. **CSV Files** (Most Common):
- Use `pd.read_csv()` for reading
- Key parameters: `sep`, `header`, `usecols`, `dtype`, `parse_dates`
- Handle large files with `chunksize`
- Export with `to_csv()`

#### 2. **Excel Files** (Business Standard):
- Use `pd.read_excel()` for reading
- Can read multiple sheets with `sheet_name=None`
- Write with `ExcelWriter` for multiple sheets
- Requires `openpyxl` library

#### 3. **JSON Files** (Web/APIs):
- Use `pd.read_json()` with appropriate `orient`
- Flatten nested JSON with `json_normalize()`
- Common for API responses
- Export with `to_json()`

#### 4. **Databases** (Production Data):
- Use `pd.read_sql()` or `pd.read_sql_query()`
- SQLite built-in with Python
- Use SQLAlchemy for other databases
- Write with `to_sql()`

#### 5. **Web Sources**:
- `pd.read_html()` for HTML tables
- `requests` library for APIs
- BeautifulSoup for web scraping
- Can read CSV directly from URLs

#### 6. **Other Formats**:
- **Parquet**: Fast, compressed (big data)
- **HDF5**: Hierarchical data
- **Pickle**: Python objects (not portable)
- **Feather**: Fast data exchange

### Reading Functions Quick Reference:

| Format | Read Function | Write Function |
|--------|---------------|----------------|
| CSV | `pd.read_csv()` | `df.to_csv()` |
| Excel | `pd.read_excel()` | `df.to_excel()` |
| JSON | `pd.read_json()` | `df.to_json()` |
| SQL | `pd.read_sql()` | `df.to_sql()` |
| HTML | `pd.read_html()` | `df.to_html()` |
| Parquet | `pd.read_parquet()` | `df.to_parquet()` |
| Pickle | `pd.read_pickle()` | `df.to_pickle()` |
| Clipboard | `pd.read_clipboard()` | `df.to_clipboard()` |

### Best Practices Summary:

1. **Always Validate**:
   - Check shape, dtypes, missing values
   - Use `df.info()` and `df.describe()`
   - Verify data integrity

2. **Handle Encoding**:
   - Try UTF-8 first
   - Fall back to latin1/cp1252
   - Specify encoding explicitly

3. **Optimize Performance**:
   - Specify dtypes upfront
   - Use `usecols` for large files
   - Consider chunksize for very large datasets

4. **Data Quality**:
   - Check for missing values
   - Remove duplicates
   - Validate data ranges
   - Monitor data quality regularly

5. **Error Handling**:
   - Use try-except blocks
   - Validate file existence
   - Handle specific exceptions
   - Log errors for debugging

6. **Security**:
   - Use parameterized SQL queries
   - Don't use pickle from untrusted sources
   - Validate data before processing
   - Sanitize user inputs

### Common Patterns:

```python
# Pattern 1: Read → Clean → Analyze → Export
df = pd.read_csv('data.csv')
df = df.dropna()  # Clean
result = df.groupby('category').sum()  # Analyze
result.to_excel('output.xlsx')  # Export

# Pattern 2: Multiple sources → Merge → Save
df1 = pd.read_csv('sales.csv')
df2 = pd.read_json('products.json')
merged = pd.merge(df1, df2, on='product_id')
merged.to_sql('sales_details', conn)

# Pattern 3: Large file → Process in chunks
for chunk in pd.read_csv('large.csv', chunksize=10000):
    process(chunk)  # Your processing logic
    save(chunk)     # Save results
```

### When to Use Which Format:

- **CSV**: Simple data, sharing with non-Python tools
- **Excel**: Business reports, formatted data
- **JSON**: Web APIs, nested structures
- **SQL**: Large datasets, multi-user access
- **Parquet**: Big data, data warehouses
- **Pickle**: Temporary Python storage (avoid for long-term)

### Remember:

Reading data is just the first step! Always:
1. Validate the data after import
2. Check for quality issues
3. Document your data sources
4. Version your data when possible
5. Consider data privacy and security

**Mastering data import is essential for any data science workflow!**