# Week 9 Lecture 2
## pandas
- [pandas](https://pandas.pydata.org/) is a Python library for working with DataFrames, the Pyret equivalent of a Table

In [1]:
import pandas as pd

A Pyret table:
```arr
orders = table: date, dish, quantity, order_type
  row: "2023-07-01", "Pasta", 2, "dine-in"
  row: "2023-07-01", "Salad", 1, "takeout"
end
```
- pandas DataFrame

In [2]:
data = {
    'date': ['2023-07-01', '2023-07-01', '2023-07-02'],
    'dish': ['Pasta', 'Salad', 'Burger'],
    'quantity': [2, 1, 3],
    'order_type': ['dine-in', 'takeout', 'dine-in']
}

orders = pd.DataFrame(data)

## Loading and Accessing Data
- pandas provides the `read_csv` method for loading CSV files


Loading in Pyret
```arr
orders = load-table: date, dish, quantity, order_type
  source: csv-table-file("orders.csv", default-options)
end
```


In [3]:
orders = pd.read_csv("orders.csv")

- You can view the first five rows with the `head()` method and the last five with `tail()`

In [None]:
orders.head()

In [None]:
orders.tail()

- Rows can be accessed using the `iloc` accessor and square bracket notation for row numbers

Pyret way:
```arr
orders.row-n(1)["dish"]
```

In [None]:
orders.iloc[1]

In [None]:
orders.iloc[1]["dish"]

- Extracting Columns as Lists

Pyret way:
```arr
quantities = orders.get-column("quantity")
```

In [5]:
quantities = orders['quantity']

- There are methods for computing statistics from a columns

Pyret way:
```arr
mean(orders, "quantity")    # Direct table operation
sum(orders, "quantity")     # Direct table operation
```

In [None]:
orders['quantity'].mean() 

In [None]:
orders['quantity'].sum()

- You can get a Series of unique values using the [unique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html#pandas.Series.unique) method

In [None]:
orders['order_type'].unique()

## Class Exercises
### Creating and Loading DataFrames
- Create a DataFrame manually with `workouts` data `activity` and `duration`. Make at least 5 rows.

In [6]:
import pandas as pd

data = {
    'activity': ['sport', 'weightlifting', 'cardio', 'swimming', 'yoga'],
    'duration (mins)': [100, 90, 50, 60, 70]
}

workouts = pd.DataFrame(data)

workouts.head()

Unnamed: 0,activity,duration (mins)
0,sport,100
1,weightlifting,90
2,cardio,50
3,swimming,60
4,yoga,70


- Load the CSV from `photos.csv` into a DataFrame. Print the first 5 rows.

In [9]:
photos = pd.read_csv("photos.csv")

photos.head()

Unnamed: 0,Location,Subject,Date
0,"Cairo, Egypt",Portrait,2023-08-15
1,"London, UK",Mountain,2024-09-14
2,"Rome, Italy",Food,2021-10-14
3,"Yellowstone National Park, WY",Concert,2024-05-31
4,"Tokyo, Japan",Food,2024-01-16


### Accessing Data
- Get the second row from your `workouts` DataFrame (remember: Python uses 0-based indexing).

In [10]:
workouts.iloc[1]

activity           weightlifting
duration (mins)               90
Name: 1, dtype: object

- Extract the `activity` column and print all unique activity names.

In [11]:
workouts['activity'].unique()

array(['sport', 'weightlifting', 'cardio', 'swimming', 'yoga'],
      dtype=object)

- Get the duration value from the third workout (combining row and column access).

In [13]:
workouts.iloc[2]['duration (mins)']

np.int64(50)

- What happens if you try to access a row that doesn't exist? Try it and note the error.


cannot access index

- What happens if you try to access a column that doesn't exist? Try it and note the error.


cannot access index

### Extracting Columns & Statistics
- Extract the `duration` column from your `workouts` DataFrame and store it in a variable called `durations`.

In [15]:
durations = workouts['duration (mins)']

- Work with the `durations` Series to find: `.mean()`, `.sum()`, `.max()`, `.min()`.

In [21]:
print('mean: ', durations.mean())

print('sum: ', durations.sum())

print('max: ', durations.max())

print('min: ', durations.min())

mean:  74.0
sum:  370
max:  100
min:  50


- Calculate the `range` (difference between `max` and `min`) of workout durations.

In [22]:
range = durations.max() - durations.min()

print(range)

50


- For the photos dataset, extract a numeric column and calculate its median using `.median()`.

In [26]:
photos = pd.read_csv("photos.csv")
photos.head()

Unnamed: 0,Location,Subject,Date
0,"Cairo, Egypt",Portrait,2023-08-15
1,"London, UK",Mountain,2024-09-14
2,"Rome, Italy",Food,2021-10-14
3,"Yellowstone National Park, WY",Concert,2024-05-31
4,"Tokyo, Japan",Food,2024-01-16


In [30]:
vals = photos["Subject"].value_counts()

print(vals)

Subject
Concert           11
Forest             9
Birthday party     9
Wedding            8
Mountain           7
Festival           7
Street art         7
Sunset             7
Food               6
City skyline       6
Portrait           5
Wildlife           5
Museum exhibit     5
Architecture       4
Beach              4
Name: count, dtype: int64


In [32]:
print(vals.median())

7.0
