### **Export Data with Pandas**
Pandas is a powerful and user-friendly Python library that is widely used for data manipulation and analysis. It helps you work with structured data (like spreadsheets or databases) efficiently and intuitively. For someone who isn’t a programmer, you can think of Pandas as a versatile tool for organizing and processing data—similar to using a digital spreadsheet (like Excel) but with the added capability of handling large datasets programmatically.

In [None]:
import pandas as pd
import numpy as np

"""
Practice Exercise: Pandas Basics
Library to install pandas, numpy, openpyxl.

Complete each function below by following the TODO instructions. 
Each function includes the objective of the task and the expected output.

Use Pandas official documentation for reference.
https://pandas.pydata.org/docs/user_guide/index.html
"""

In [2]:
"""
Objective: Convert data list into a Pandas DataFrame
"""
import pandas as pd
name = ["John Doe", "Nadia", "Serena", "Tessa", "Una"]
age = [25, 31, 23, 17, 23]
city = ["New York", "London", "Paris", "Tokyo", "Sydney"]

# TODO: Pair the list into a dictionary
# TODO: Create a dataframe from the dictionary
# TODO: Validate the dataframe output and the object type

# Create dictionary from lists
data_dict = {
    'name': name,
    'age': age, 
    'city': city
}

# Create DataFrame
df = pd.DataFrame(data_dict)

# Validate output
print("DataFrame:")
print(df)
print("\nDataFrame Type:", type(df))

DataFrame:
       name  age      city
0  John Doe   25  New York
1     Nadia   31    London
2    Serena   23     Paris
3     Tessa   17     Tokyo
4       Una   23    Sydney

DataFrame Type: <class 'pandas.core.frame.DataFrame'>


In [3]:
"""
Objective: Convert data dictionaries into a Pandas DataFrame
"""
dict_1 = {"name": "John Doe", "age": 25, "city": "New York"}
dict_2 = {"name": "Nadia", "age": 31, "city": "London"}
dict_3 = {"name": "Serena", "age": 23, "city": "Paris"}
dict_4 = {"name": "Tessa", "age": 17, "city": "Tokyo"}
dict_5 = {"name": "Una", "age": 23, "city": "Sydney"}

# TODO: Pair the dictionary into a list
# TODO: Create a dataframe from the list
# TODO: Validate the dataframe output and the object type

dict_1 = {"name": "John Doe", "age": 25, "city": "New York"}
dict_2 = {"name": "Nadia", "age": 31, "city": "London"}
dict_3 = {"name": "Serena", "age": 23, "city": "Paris"}
dict_4 = {"name": "Tessa", "age": 17, "city": "Tokyo"}
dict_5 = {"name": "Una", "age": 23, "city": "Sydney"}

# Create list of dictionaries
data_list = [dict_1, dict_2, dict_3, dict_4, dict_5]

# Create DataFrame from list of dictionaries
df = pd.DataFrame(data_list)

# Validate output
print("DataFrame:")
print(df)
print("\nDataFrame Type:", type(df))

DataFrame:
       name  age      city
0  John Doe   25  New York
1     Nadia   31    London
2    Serena   23     Paris
3     Tessa   17     Tokyo
4       Una   23    Sydney

DataFrame Type: <class 'pandas.core.frame.DataFrame'>


In [4]:
"""
Objective: Adding new columns to a Pandas DataFrame
"""
# TODO: Assign the new list into a dataframe column name that not exist yet
# TODO: Validate the dataframe output

# Create new column data
is_married = [True, False, True, False, False]

# Add new column to existing DataFrame
df['is_married'] = is_married

# Validate output
print("Updated DataFrame with new column:")
print(df)

Updated DataFrame with new column:
       name  age      city  is_married
0  John Doe   25  New York        True
1     Nadia   31    London       False
2    Serena   23     Paris        True
3     Tessa   17     Tokyo       False
4       Una   23    Sydney       False


In [7]:
"""
Objective: Adding new rows to a Pandas DataFrame
"""
new_row = {"name": "Victoria", "age": 30, "city": "New York", "is_married": True}

# TODO: Create a dataframe from the dictionary
# TODO: Concatenate previous dataframe with new dataframe and ignore index
# TODO: Validate the dataframe output

import pandas as pd
# Create DataFrame from the new row
new_df = pd.DataFrame([new_row])

# Concatenate with previous DataFrame and reset index
df = pd.concat([df, new_df], ignore_index=True)

# Validate output
print("Updated DataFrame with new row:")
print(df)

Updated DataFrame with new row:
       name  age      city  is_married
0  John Doe   25  New York        True
1     Nadia   31    London       False
2    Serena   23     Paris        True
3     Tessa   17     Tokyo       False
4       Una   23    Sydney       False
5  Victoria   30  New York        True
6  Victoria   30  New York        True
7  Victoria   30  New York        True


In [8]:
""" 
Objective: Renaming columns
"""
# TODO: Create a dictionary of {old column: new column} in columns variable
# TODO: Use .rename(columns=columns) and assign columns variable as parameter
# TODO: Check the new renamed dataframe

# Create dictionary for column renaming
columns = {
    'name': 'full_name',
    'age': 'years_old',
    'city': 'location',
    'is_married': 'marital_status'
}

# Rename columns
df = df.rename(columns=columns)

# Check renamed DataFrame
print("DataFrame with renamed columns:")
print(df)

DataFrame with renamed columns:
  full_name  years_old  location  marital_status
0  John Doe         25  New York            True
1     Nadia         31    London           False
2    Serena         23     Paris            True
3     Tessa         17     Tokyo           False
4       Una         23    Sydney           False
5  Victoria         30  New York            True
6  Victoria         30  New York            True
7  Victoria         30  New York            True


In [None]:
"""
Objective: Export as CSV
"""
# TODO: Use .to_csv(filename) to export as csv file

In [9]:
"""
Objective: Export as Excel without index
"""
# TODO: Use .to_excel(filename, index=False) to export as excel file without index
# Export DataFrame to CSV
df.to_csv('people_data.csv', index=False)
print("Data has been exported to people_data.csv")

Data has been exported to people_data.csv


### **Reflection**
Is there any difference in data represented as a csv or an excel using Pandas?

(answer here)

Yes, there are several key differences between CSV and Excel files when using Pandas:

1. File Format
   
   - CSV (Comma-Separated Values) is a simple text file format
   - Excel (.xlsx) is a binary format that can contain multiple sheets and formatting
2. Data Types
   
   - CSV stores everything as text, requiring type conversion when reading
   - Excel preserves data types, formulas, and formatting
3. Features
   
   - CSV:
     - Simple and lightweight
     - Universal compatibility
     - Better for version control
     - Faster to read/write for large datasets
   - Excel:
     - Supports multiple sheets
     - Maintains formatting (colors, fonts, borders)
     - Can store formulas
     - Has cell merging and other advanced features
4. Storage Size
   
   - CSV files are typically smaller
   - Excel files are larger due to additional metadata and features
5. Processing Speed
   
   - CSV files generally load faster in Pandas
   - Excel files take longer to process due to additional features
For simple data storage and transfer, CSV is often preferred. For complex data presentation with formatting and multiple sheets, Excel is more suitable.

### **Exploration**
Pandas has .read_html() methods that dirrectly reading HTML content or even a URL. Can we replace the need of Requests+BeautifulSoup by just using pandas.read_html()?
Try by scraping https://www.scrapingcourse.com/table-parsing using requests to get the HTML and pandas to extract the HTML content.

Here's a comparison of both approaches:


In [10]:
# Approach 1: Using requests + pandas.read_html()
import requests
import pandas as pd

# Get HTML content using requests
url = "https://www.scrapingcourse.com/table-parsing"
response = requests.get(url)
html_content = response.text

# Parse tables using pandas
tables = pd.read_html(html_content)
print("Number of tables found:", len(tables))
print("\nFirst table using requests + pandas:")
print(tables[0])

# Approach 2: Direct URL with pandas.read_html()
tables_direct = pd.read_html(url)
print("\nFirst table using pandas directly:")
print(tables_direct[0])

  tables = pd.read_html(html_content)


Number of tables found: 1

First table using requests + pandas:
    Product ID                 Name       Category    Price In Stock
0            1               Laptop    Electronics  $999.99      Yes
1            2           Smartphone    Electronics  $599.99      Yes
2            3           Headphones          Audio  $149.99       No
3            4         Coffee Maker     Appliances   $79.99      Yes
4            5        Running Shoes         Sports   $89.99      Yes
5            6          Smart Watch    Electronics  $249.99      Yes
6            7              Blender     Appliances   $39.99       No
7            8             Yoga Mat         Sports   $29.99      Yes
8            9       Wireless Mouse    Electronics   $24.99      Yes
9           10            Desk Lamp           Home   $34.99      Yes
10          11     Portable Speaker          Audio   $79.99       No
11          12  Electric Toothbrush  Personal Care   $49.99      Yes
12          13             Backpack    

HTTPError: HTTP Error 403: Forbidden

While pd.read_html() can work directly with URLs, using Requests+BeautifulSoup is still preferable for web scraping because:

1. Flexibility
   
   - pd.read_html() only extracts HTML tables ( <table> elements)
   - BeautifulSoup can extract any HTML element
2. Control
   
   - Requests allows control over headers, cookies, sessions
   - Can handle authentication, redirects, and timeouts
3. Error Handling
   
   - Better error handling with Requests
   - Can retry failed requests
   - Can handle different status codes
4. Pre-processing
   
   - Can modify HTML before parsing
   - Can handle dynamic content
   - Can clean data before parsing
So while pd.read_html() is convenient for simple table extraction, Requests+BeautifulSoup offers more control and flexibility for complex web scraping tasks.