## **Python Data Structures Practice**

We will practice with Python data structures and how it can integrate with Pandas data types, namely `DataFrame` and `Series`.


Let's use the our good old `flights.csv` one more time.

In [None]:
# Import pandas and upload the flights.csv to your Colab
# Then, read the dataset into a pandas DataFrame

# write your code here

#### 1. **Lists:**

a. Create a list of all unique `origin` airports from the dataset.

In [None]:
# Write your code here

b. How many unique destinations (`dest`) are present in the dataset?

You can use the built in `len()` method here to get the length of your list.

In [None]:
# Write your code here

#### 2. **Dictionaries:**

a. Create a dictionary where the keys are the unique carriers (`carrier`) and the values are the average departure delays (`dep_delay`) for each carrier.

This question might be a little involved. Let's break it down into a few steps:

1) First, you will want to group the `df` by the unique values in the `carrier` column. This means that all rows with the same carrier will be considered as a single group.

2) After grouping the rows by carrier, you will select just the `dep_delay` column from each group. This results in a `SeriesGroupBy` object where each group (i.e., each unique carrier) has its own set of departure delay values.

3) Then, for each group of departure delay values (corresponding to each carrier), you will calculate the average (or mean) delay. This will result in a new Series object where the index is the unique carriers and the values are the average departure delays for each carrier.

4) Finally, you will convert the resulting `Series` into a dictionary using the `to_dict()` method that you have seen in the lecture slides.

The keys of the dictionary will be the unique carriers, and the values are the average departure delays for each carrier.

In [None]:
# Write your code here

b. Using the dictionary from the previous question, find out which carrier has the highest average delay. The python method `max()` will be useful here.

In [None]:
# Write your code here

#### **3. Tuples:**

a. Extract the year, month, and day of the 100th flight in the dataset as a tuple.

In [None]:
# Write your code here

b. From the dataset, create a tuple containing the details of the flight with the maximum air time.

In [None]:
# Write your code here

#### **4. Sets:**

a. Create a set of all unique flight numbers (`flight`) in the dataset.

In [None]:
# write your code here

b. From the dataset, find the total number of unique tail numbers (`tailnum`).

In [None]:
# Write your code here

#### **5. Integration with Pandas:**

a. Using lists, find the top 5 most frequent destinations (`dest`).


Let's break this one down:

1) First we need to select the `dest` column from the df as we know it contains the destination info.

2) Then, we can apply a very useful pandas method `value_counts()` that you have seen before to count the occurrence of each unique value in the Series `dest`. This will result is another Series where the index is the unique values from the original Series (`dest`), and the values are the counts of those unique values. This Series will be sorted by descending order by default.

3) How can we get the first few rows of this result? Wait! You know how! give it a `head()` and add 5, since we know that it is already sorted in descending order.

4) Next, we need to access the index of the Series, which, in this context, contains the top 5 most frequent destination values. We can do this with the `index` which is an attribute of the pandas Series object. Specifically, every Series (and DataFrame) in pandas has an index attribute that gives access to the index labels of the Series.

5) Finally, we will convert what we have to a python list, and you have learned how to do this in the lecture slides (now on Canvas).

You can do it! Give it a try!

In [None]:
# Write your code here

b. Using dictionaries, map each `origin` airport to the count of flights departing from it.

In [None]:
# Write your code here

d. Using sets, find out how many unique carriers operate flights from each `origin`.

In [None]:
# Write your code here

### **6. Bonus: Visualization**

a. Filter out flights that have missing values in any of the columns.

**Analysis and Visualization:**

a. Find the top 5 carriers (airlines) with the highest number of delayed flights, where a flight is considered delayed if its dep_delay is more than 15 minutes.

b. For these top 5 carriers, visualize the average departure delay using a bar plot. Each carrier should be represented as a separate bar in the plot.

Tips:

* For visualization, use the matplotlib library.
* Utilize pandas operations, such as `groupby()`, `sort_values()`, and `mean()` as/if needed.

In [None]:
# Write your code here



**Data Structures:**

a. Convert the list of top 5 carriers into a set named `top_carriers_set`.

In [None]:
# Write your code here

# Keep the following line as the last line of this cell. Observe the output after you complete the question and run this cell.
print(type(top_carriers_set))