# MAS DSE 200: Homework 2 - Pandas

#### Tasks:

- Submission on Gradescope:
  - Submit this Jupyter notebook to "Homework 2"


---

Remember: when in doubt, read the documentation first.

Python - https://docs.python.org/3/

NumPy - https://numpy.org/doc/stable/

pandas - https://pandas.pydata.org/docs/

## Instructions

* You don’t need to explain your approach (unless specified) so please be concise in your submission.
* To obtain full marks for a question, both the answer and the code should be correct.
* Completely wrong (or missing) code with correct answer will result in zero marks.
* Please code the solution in the space provided.

### Imports

Import necessary packages

In [34]:
!pip install pandas numpy requests pillow rasterio



In [35]:
import pandas as pd
import numpy as np
import requests
from pathlib import Path
from PIL import Image
from io import BytesIO
import rasterio
from rasterio.transform import from_bounds

## Part 1: Titanic

### Preliminaries

* Grab the dataset from `https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv` and store it in a pandas dataframe called `passengers`.

In [None]:
# YOUR CODE HERE
# download from https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv
with open('titanic.csv', 'wb') as f:
    response = requests.get('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv')
    f.write(response.content)


In [None]:
# open the csv file as a pandas dataframe
passengers = None

### 1: Get to know your data - **20 points**

**1.1**: Print the first 15 entries in the dataframe to see what the columns are and what some values will look like - **5 points**

In [None]:
# YOUR CODE HERE


**1.2**: Next, set the index of the dataframe to the `PassengerId` column, and print the first 10 elements again to ensure the change took place - **5 points**

In [None]:
# YOUR CODE HERE


**1.3**: How many samples are there in this dataset? - **5 points**

In [None]:
# YOUR CODE HERE


**1.4** How many samples contain `null`/`NaN` in atleast one of the columns? - **5 points**

In [None]:
# YOUR CODE HERE


### 2: Summary statistics - **30 points**

**2.1**: Print the `min`, `max`, `mean` and `median` of age and fare of all passengers - **10 points**

Hint - Look at [`DataFrame.agg`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html#pandas.DataFrame.agg)

In [None]:
# YOUR CODE HERE


**2.2**: What is the average ticket fare price for male vs female passengers on the Titanic? - **10 points**

Note - The output should only have `Sex` and `Fare`

*Hint* - Look at [`DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)

In [None]:
# YOUR CODE HERE


**2.3** What is the mean age for each of the sex and cabin class combinations? - **10 points**

In [None]:
# YOUR CODE HERE


### 3:  Number of passengers in different classes - **20 points**

**3.1**: What is the number of passenges in different classes according to this dataset? (Hint: Pclass represents the class of a passenger.) - **5 points**

In [None]:
# YOUR CODE HERE


**3.2** How many passengers in 1st class (`Pclass = 1`) are women (`Sex = female`) above the age of 27? - **5 points**

In [None]:
# YOUR CODE HERE


**3.3** What fraction of passengers from each class survived? (`Survived=1`) - **10 points**

In [None]:
# YOUR CODE HERE


### 4:  Fares - **30 points**

**4.1**: How many different fares were charged on the Titanic based on the dataset? - **5 points**

In [None]:
# YOUR CODE HERE


**4.2**: Find the top 10 fares charged from the passengers. **Report these fare values**, and then **calculate the total number of passengers** who paid one of these top 10 fare amounts - **10 points**

In [None]:
# YOUR CODE HERE

**4.3**: Create a new dataset, called `passengers_filtered`, that includes only entries of passengers who paid one of these top 10 fares. **Report the number of samples** in the original dataset and in the new dataset to ensure the desired effect took place - **10 points**

**hint:** Check out the Pandas Series function [isin](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html)

In [None]:
# YOUR CODE HERE


### 5:  Ages - **30 points**

**5.1**: What was the minimum, maximum and average age of passengers on the Titanic? - **5 points**

In [None]:
# YOUR CODE HERE


**5.2**: How many passengers on the Titanic were within one standard deviation of the mean age calculated in **5.1**? - **10 points**

In [None]:
# YOUR CODE HERE


**5.3**: How many of the passengers found in **5.2** were females over the age of 25? - **5 points**

In [None]:
# YOUR CODE HERE


**5.4**: What are the 10 **most** common ages of passengers according to this dataset? - **5 points**

In [None]:
# YOUR CODE HERE


## Part 2: Beer Review - 20 points

Use the `beer_reviews` dataframe created for you

In [None]:
reviews = []
response = requests.get("https://jmcauley.ucsd.edu/cse255/data/beer/beer_50000.json")
for line in response.text.splitlines():
    reviews.append(eval(line))

beer_reviews = pd.DataFrame(reviews)

**6.1**: Which are the top 15 beers with the highest average ratings (`review/overall`)? - **10 points**

In [None]:
# YOUR CODE HERE


**6.2**: Which of the following - `review/palate`, `review/taste`, `review/aroma`, length of `review/text`(number of words in the review text) - correlate highest with `review/overall`? - **10 points**

NOTE - `review/text` is of type string while the other reviews are of type float. Use the length of `review/text` instead. You may need to create a new column in the data frame

In [None]:
# YOUR CODE HERE


## Part 3: Geospatial Data - 10 points

In [None]:

data_dir = Path('fire_data')
data_dir.mkdir(exist_ok=True)

In [None]:
# Select a new area
# http://bboxfinder.com/#-117.500249,32.665157,-116.820470,33.012173
# Go to the above link and select an area of interest
# Example bbox for San Diego area: [-117.500249, 32.665157, -116.820470, 33.012173]
bbox = [-117.500249, 32.665157, -116.820470, 33.012173]  # Replace with your chosen area

print(f"Study Area Bounding Box: {bbox}")
print(f"Longitude: {bbox[0]:.2f} to {bbox[2]:.2f}")
print(f"Latitude: {bbox[1]:.2f} to {bbox[3]:.2f}")

In [None]:
def get_naip_imagery(bbox, output_path):
    """
    Fetch NAIP imagery - simple version
    Always gets 1024x1024 pixels, no calculations
    """
    base_url = "https://imagery.nationalmap.gov/arcgis/rest/services/USGSNAIPImagery/ImageServer/exportImage"

    params = {
        'bbox': f"{bbox[0]},{bbox[1]},{bbox[2]},{bbox[3]}",
        'bboxSR': '4326',
        'size': '1024, 1024',
        'imageSR': '4326',
        'format': 'tiff',  # Get TIFF to preserve all bands
        'pixelType': 'U16',
        'f': 'image'
    }

    print(f"Requesting NAIP imagery: 1024x1024 pixels")
    response = requests.get(base_url, params=params, timeout=120)

    if response.status_code == 200:
        # Save directly - no conversion needed
        with open(output_path, 'wb') as f:
            f.write(response.content)

        print(f"✓ NAIP imagery saved to {output_path}")
        return True
    else:
        print(f"Error fetching NAIP: {response.status_code}")
        return False


# Use it
naip_file = data_dir / 'naip_imagery.tif'
print("Downloading NAIP imagery...")
success = get_naip_imagery(bbox, naip_file)

if success:
    # Check what we got
    with rasterio.open(naip_file) as src:
        print(f"Bands: {src.count}")
        print(f"Size: {src.width}x{src.height}")

### Task - Geospatial Data Analysis - 10 points

Based on the NAIP imagery data you've downloaded, create your own analytics to explore the geospatial data.

**Requirements (10 points total):**
1. **Visualize the imagery** (3 points) - Display the RGB bands properly
2. **Identify color patterns** (3 points) - Create masks to identify 2 colors (e.g., green for vegetation, brown for bare soil) or any other colors of your choice
3. **Analyze spatial patterns** (4 points) - Use your color masks to calculate statistics (e.g., percentage of green vs brown areas, spatial distribution)

You should create 3 different visualizations or analyses demonstrating your understanding of geospatial data manipulation with pandas and rasterio.

In [None]:
# Your code here