# Case Study

## Re-cap

### 1 - Zip function

The zip function accepts an arbitrary number of iterables and returns an iterator of tuples.

```python
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(type(z))

<class 'zip'>

print(list(z))

[('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]
```

### 2 - Defining a function

For writing functions, function headers begin with the keyword def, followed by the function name, arguments inside parentheses and a colon. We then have the function body, with the docstrings enclosed in triple quotation marks; the rest of the function body performs the computation that the function does and closes with the keyword return, followed by the value or values to return.

```python
def raise_both(value1, value2):
    """Raise value1 to the power of value2 and vice versa."""
    new_value1 = value1 ** value2
    new_value2 = value2 ** value1
    new_tuple = (new_value1, new_value2)
    return new_tuple

print(raise_both(3, 4))

(81, 64)
```

### 3 - List comprehensions

Comprehensions, in their most basic forms, are enclosed in square brackets and are structured as output expression for iterator variable in iterable. More advanced comprehensions can include conditionals on the output expression and/or conditionals on the iterable.

* Basic
    ```python
    [output expression for iterator variable in iterable]
    ```

* Advanced
    ```python
    [output expression +
    conditional on output for iterator variable in iterable +
    conditional on iterable]
    ```

__Zipping Dictionaries__

_Instructions_

* Create a zip object by calling `zip()` and passing to it `feature_names and `row_vals`. Assign the result to `zipped_lists`.
* Create a dictionary from the `zipped_lists` zip object by calling `dict()` with `zipped_lists`. Assign the resulting dictionary to `rs_dict`.

In [11]:
feature_names = ['CountryName',
 'CountryCode',
 'IndicatorName',
 'IndicatorCode',
 'Year',
 'Value']
row_vals = ['Arab World',
 'ARB',
 'Adolescent fertility rate (births per 1,000 women ages 15-19)',
 'SP.ADO.TFRT',
 '1960',
 '133.56090740552298']

# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)


{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


__Writing a function to help us__

In this exercise, we will create a function to house the code we wrote earlier to make things easier and much more concise. Why? This way, we only need to call the function and supply the appropriate lists to create your dictionaries! 

_Instructions_
* Define the function `lists2dict()` with two parameters: `list1` and `list2`.
* Return the resulting dictionary `rs_dict` in `lists2dict()`.
* Call the `lists2dict()` function with the arguments `feature_names` and `row_vals`. Assign the result of the function call to `rs_fxn`.

In [22]:
# feature_names and row_vals have been preloaded from the previous exercise

def lists2dict(list1, list2):
    """Return a dictionary where list1 providesthe keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    # Return the dictionary
    return rs_dict

# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

{'Country Name': 'Arab World', 'Country Code': 'ARB', 'Indicator Name': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'Indicator Code': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


__Using a list comprehension__

This time, we're going to use the `lists2dict()` function we defined in the last exercise to turn a bunch of lists into a list of dictionaries with the help of a list comprehension.

The `lists2dict()` function has already been defined above.

Our goal is to use a list comprehension to generate a list of dicts, where the keys are the header names and the values are the row entries.

_Instructions_
* Inspect the contents of `row_lists` by printing the first two lists in `row_lists`.
* Create a list comprehension that generates a dictionary using `lists2dict()` for each sublist in `row_lists`. The keys are from the `feature_names` list and the values are the row entries in `row_lists`. Use `sublist` as your iterator variable and assign the reulting list of dictionaries to `list_of_dicts`.
* Look at the first two dictionaries in your new list of dictionaries.

In [25]:
import pandas as pd

df = pd.read_csv('../Databases/World-Development-Indicators/WDICSV_subset.csv')

feature_names = list(df.columns)  
row_lists = df.values.tolist()  

lists2dict(feature_names, row_lists[0])     # lists2dict is already defined in the previous exercise

# Print out first two lists in row_lists
print(row_lists[0])
print(row_lists[1], '\n')

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])


['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', 1960, 133.56090740552298]
['Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', 1960, 87.7976011532547] 

{'Country Name': 'Arab World', 'Country Code': 'ARB', 'Indicator Name': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'Indicator Code': 'SP.ADO.TFRT', 'Year': 1960, 'Value': 133.56090740552298}
{'Country Name': 'Arab World', 'Country Code': 'ARB', 'Indicator Name': 'Age dependency ratio (% of working-age population)', 'Indicator Code': 'SP.POP.DPND', 'Year': 1960, 'Value': 87.7976011532547}


__Turning this all into a DataFrame__

We've zipped lists together, created a function to house your code, and even used the function in a list comprehension to generate a list of dictionaries.

We will now use all of these to convert the list of dictionaries into a pandas DataFrame. We will see how convenient it is to generate a DataFrame from dictionaries with the `DataFrame()` function from the pandas package.

The `lists2dict()` function, `feature_names` list, and `row_lists` list have been preloaded for this exercise.

_Instructions_
* To use the `DataFrame()` function you need, first import the pandas package with the alias `pd`.
* Create a DataFrame from the list of dictionaries in `list_of_dicts` by calling `pd.DataFrame()`. Assign the resulting DataFrame to `df`.
* Inspect the contents of `df` printing the head of the DataFrame. Head of the DataFrame `df` can be accessed by calling `df.head()`.

In [30]:
# pandas is imported as pd, list_of_dicts is available from the previous exercise

df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

  Country Name Country Code                                     Indicator Name  Indicator Code  Year         Value
0   Arab World          ARB  Adolescent fertility rate (births per 1,000 wo...     SP.ADO.TFRT  1960  1.335609e+02
1   Arab World          ARB  Age dependency ratio (% of working-age populat...     SP.POP.DPND  1960  8.779760e+01
2   Arab World          ARB  Age dependency ratio, old (% of working-age po...  SP.POP.DPND.OL  1960  6.634579e+00
3   Arab World          ARB  Age dependency ratio, young (% of working-age ...  SP.POP.DPND.YG  1960  8.102333e+01
4   Arab World          ARB        Arms exports (SIPRI trend indicator values)  MS.MIL.XPRT.KD  1960  3.000000e+06


## Using Python generators for streaming data

### Processing data in chunks


__The following code is an exception for the course procedure: In order to match the data used in DataCamp, the csv file `WDICSV.csv` has been modified and extracted as `WDICSV_world_dev_ind_datacamp.csv`. It still has much more data comparing to the original `world_dev_ind.csv` file.
Skip the next code cell to avoid downloading the file again, and move on to the one after it.__

In [48]:
"""Only look at this cell if you want to see how the csv file explained above was modified."""
import pandas as pd

# Read the CSV file
df = pd.read_csv('../Databases/World-Development-Indicators/WDICSV.csv')

# Define the feature names
feature_names = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value']

# Create the row_lists
row_lists = []
for index, row in df.iterrows():
    if pd.notna(row['1960']):
        row_list = [
            row['Country Name'],
            row['Country Code'],
            row['Indicator Name'],
            row['Indicator Code'],
            1960,
            row['1960']
        ]
        row_lists.append(row_list)

# Create a new DataFrame from the row_lists with the specified column names
filtered_df = pd.DataFrame(row_lists, columns=feature_names)

# Save the filtered DataFrame to a new CSV file. Remove "#" from the next line to save the file if there's no at /Databases/World-Development-Indicators"
# filtered_df.to_csv('../Databases/World-Development-Indicators/WDICSV_world_dev_ind_datacamp.csv', index=False)

### EXERCISE

The csv file `'WDICSVworld_dev_ind_datacamp.csv'` is in the directory for our use (check if it's there). To begin, we need to open a connection to this file using what is known as a context manager. For example, the command `with open('datacamp.csv')` as datacamp binds the csv file `'datacamp.csv' as datacamp` in the context manager. Here, the `with` statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

### Instructions

* Use `open()` to bind the csv file `'WDICSVworld_dev_ind_datacamp.csv'` as file in the context manager.
* Complete the `for` loop so that it iterates __1000__ times to perform the loop body and process only the first 1000 rows of data of the file.

In [46]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('../Databases/World-Development-Indicators/WDICSV_world_dev_ind_datacamp.csv')


# Open a connection to the file
with open('../Databases/World-Development-Indicators/WDICSV_world_dev_ind_datacamp.csv') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):     # csv file has over 35000 rows, but we only process the first 1000 rows for speed

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

{'Africa Eastern and Southern': 125, 'Africa Western and Central': 122, 'Arab World': 120, 'Caribbean small states': 118, 'Central Europe and the Baltics': 114, 'Early-demographic dividend': 152, 'East Asia & Pacific': 173, 'East Asia & Pacific (excluding high income)': 76}


### Writing a generator to load data in chunks

In this case, it would be useful to use generators. Generators allow users to [lazily evaluate](https://www.blog.pythonlibrary.org/2014/01/27/python-201-an-intro-to-generators/) data. This concept of _lazy evaluation_ is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

In this exercise, we will define a generator function `read_large_file()` that produces a generator object which yields a single line from a file each time `next()` is called on it. The csv file `'WDICSV_world_dev_ind_datacamp.csv'` is in the directory for our use.

Note that when you open a connection to a file, the resulting file object is already a generator! So out in the wild, you won't have to explicitly create generator objects in cases such as this. But for pedagogical reasons, we are having you practice how to do this here with the `read_large_file()` coroutine.

### Instructions

* In the function `read_large_file()`, read a line from file_object by using the method `readline()`. Assign the result to `data`.
* In the function read_large_file(), yield the line read from the file `data`.
* In the context manager, create a generator object `gen_file` by calling your generator function `read_large_file()` and passing `file` to it.
* Print the first three lines produced by the generator object `gen_file` using `next()`.

In [50]:
# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data

# Open a connection to the file
with open('../Databases/World-Development-Indicators/WDICSV_world_dev_ind_datacamp.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value

Africa Eastern and Southern,AFE,"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,1960,135.79329065051937

Africa Eastern and Southern,AFE,Age dependency ratio (% of working-age population),SP.POP.DPND,1960,88.96769659511115



Now let's use our generator function to process the World Bank dataset like we did previously. We will process the file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. For this exercise, however, we won't process just 1000 rows of data, we'll process the entire dataset!

The generator function `read_large_file()` and the csv file `'WDICSV_world_dev_ind_datacamp.csv'` are preloaded from the previous exercise and ready for our use.

### Instructions
* Bind the file `'WDICSV_world_dev_ind_datacamp.csv'` to `file` in the context manager with `open()`.
* Complete the for loop so that it iterates over the generator from the call to `read_large_file()` to process all the rows of the file.


In [54]:
# initialize an empty dictionary: counts_dict
counts_dict = {}

"""Using generators can be slower than reading in the entire file at once. In practice, for small files, it doesn't matter much.
For very large files, however, the difference in processing time can be substantial."""

# Open a connection to the file
with open('../Databases/World-Development-Indicators/WDICSV_world_dev_ind_datacamp.csv') as file:

    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):
        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

# Print
print(counts_dict)

{'CountryName': 1, 'Africa Eastern and Southern': 125, 'Africa Western and Central': 122, 'Arab World': 120, 'Caribbean small states': 118, 'Central Europe and the Baltics': 114, 'Early-demographic dividend': 152, 'East Asia & Pacific': 173, 'East Asia & Pacific (excluding high income)': 169, 'East Asia & Pacific (IDA & IBRD countries)': 143, 'Euro area': 137, 'Europe & Central Asia': 153, 'Europe & Central Asia (excluding high income)': 145, 'Europe & Central Asia (IDA & IBRD countries)': 123, 'European Union': 137, 'Fragile and conflict affected situations': 126, 'Heavily indebted poor countries (HIPC)': 119, 'High income': 166, 'IBRD only': 151, 'IDA & IBRD total': 151, 'IDA blend': 125, 'IDA only': 116, 'IDA total': 119, 'Late-demographic dividend': 144, 'Latin America & Caribbean': 182, 'Latin America & Caribbean (excluding high income)': 183, 'Latin America & the Caribbean (IDA & IBRD countries)': 157, 'Least developed countries: UN classification': 116, 'Low & middle income': 17

## Using pandas' read_csv iterator for streaming data

__The following code is an exception for the course procedure: In order to match the data used in DataCamp, the csv file `WDICSV.csv` has been modified and extracted as `WDICSV_ind_pop_datacamp.csv`. It has a bit more rows comparing to the original `ind_pop.csv` file.
Skip the next code cell to avoid downloading the file again, and move on to the one after it.__

In [None]:
"""Only look at this cell if you want to see how the csv file explained above was modified."""
import pandas as pd

# Read the CSV file
df = pd.read_csv('../Databases/World-Development-Indicators/WDICSV.csv')

# Define the feature names
feature_names = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value']

# Get all year columns (assuming they're numeric column names between 1960 and 2023)
year_columns = [str(year) for year in range(1960, 2024)]

# Create temporary list
temp_list = []

# Get all year columns
year_columns = [str(year) for year in range(1960, 2024)]

for index, row in df.iterrows():
    if row['Indicator Name'] == 'Urban population (% of total population)':
        for year in year_columns:
            if pd.notna(row[year]):
                temp_list.append([
                    row['Country Name'],
                    row['Country Code'],
                    row['Indicator Name'],
                    row['Indicator Code'],
                    int(year),
                    row[year]
                ])

# Create a new DataFrame from the row_lists with the specified column names
filtered_df = pd.DataFrame(row_lists, columns=feature_names)

# Then sort the DataFrame by the 'Year' column
filtered_df = filtered_df.sort_values(['Year', 'CountryName'])

# Save the filtered DataFrame to a new CSV file. Remove "#" from the next line to save the file if there's no at /Databases/World-Development-Indicators"
# filtered_df.to_csv('../Databases/World-Development-Indicators/WDICSV_ind_pop_datacamp.csv', index=False)

In [88]:
# Import the pandas package
import pandas as pd

# Initialize reader object: df_reader
df_reader = pd.read_csv('../Databases/World-Development-Indicators/WDICSV_ind_pop_datacamp.csv', chunksize=10)

# Print two chunks
print(next(df_reader))
print(next(df_reader))

                   CountryName CountryCode                             IndicatorName      IndicatorCode  Year  Value
0                  Afghanistan         AFG  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960   8.40
1  Africa Eastern and Southern         AFE  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  14.58
2   Africa Western and Central         AFW  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  14.71
3                      Albania         ALB  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  30.70
4                      Algeria         DZA  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  30.51
5               American Samoa         ASM  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  66.21
6                      Andorra         AND  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  58.45
7                       Angola         AGO  Urban population (% 

In the previous exercise, we used `read_csv()` to read in DataFrame chunks from a large dataset. In this exercise, we will read in a file using a bigger DataFrame chunk size and then process the data from the first chunk.

To process the data, we will create another DataFrame composed of only the rows from a specific country. You will then zip together two of the columns from the new DataFrame,  ~~`'Total Population'`~~ `Year` and ~~`'Urban population (% of total population)'`~~ `Value` . Finally, we will create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

`pandas` has been imported as pd.

### Instructions
* Use `pd.read_csv()` to read in the file in `../Databases/World-Development-Indicators/WDICSV_ind_pop_datacamp.csv` in chunks of size `1000`. Assign the result to `urb_pop_reader`.

* Get the first DataFrame chunk from the iterable `urb_pop_reader` and assign this to `df_urb_pop`.

* Select only the rows of `df_urb_pop` that have a `'CountryCode'` of `'CEB'`. To do this, compare whether `df_urb_pop['CountryCode']` is equal to `'CEB'` within the square brackets in `df_urb_pop[____]`.

* Using `zip()`, zip together the ~~`'Total Population'`~~ `Year` and ~~`'Urban population (% of total population)'`~~ `Value` columns of `df_pop_ceb`. Assign the resulting zip object to `pops`.

__Note:__ On the Datacamp course the columns were `'Total Population'` and `'Urban population (% of total population)'`. However, since we cannot access the original database, we will use the `Year` and `Value` columns instead.

In [94]:
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('../Databases/World-Development-Indicators/WDICSV_ind_pop_datacamp.csv', chunksize=1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

# Check out the head of the DataFrame
print(df_urb_pop.head())

# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Year'], df_pop_ceb['Value'])

# Turn zip object into list: pops_list
pops_list = list(pops)

# Print pops_list
print(pops_list)

                   CountryName CountryCode                             IndicatorName      IndicatorCode  Year  Value
0                  Afghanistan         AFG  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960   8.40
1  Africa Eastern and Southern         AFE  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  14.58
2   Africa Western and Central         AFW  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  14.71
3                      Albania         ALB  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  30.70
4                      Algeria         DZA  Urban population (% of total population)  SP.URB.TOTL.IN.ZS  1960  30.51
[(1960, 44.50789271439007), (1961, 45.2073380737434), (1962, 45.8673685184926), (1963, 46.5349208922556)]
