# Assignment 1
### Understanding Uncertainty
### Due 9/5

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [1]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Downloading course data
Download complete
Extracting data files...
Data extracted


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

In [None]:
import pandas as pd
import numpy as np

fg = pd.read_csv("./data/ForeignGifts_edu.csv")
fg.head()

Unnamed: 0,ID,OPEID,Institution Name,City,State,Foreign Gift Received Date,Foreign Gift Amount,Gift Type,Country of Giftor,Giftor Name
0,1,102000,Jacksonville State University,Jacksonville,AL,43738,250000,Monetary Gift,CHINA,
1,2,104700,Troy University,Troy,AL,43592,463657,Contract,CHINA,Confucius Institute Headquarters
2,3,105100,University of Alabama,Tuscaloosa,AL,43466,3649107,Contract,ENGLAND,Springer Nature Customer Service Ce
3,4,105100,University of Alabama,Tuscaloosa,AL,43472,1000,Contract,SAUDI ARABIA,Saudi Arabia Education Mission
4,5,105100,University of Alabama,Tuscaloosa,AL,43479,49476,Contract,SAUDI ARABIA,Saudi Arabia Education Mission


In [9]:
fg.columns

Index(['ID', 'OPEID', 'Institution Name', 'City', 'State',
       'Foreign Gift Received Date', 'Foreign Gift Amount', 'Gift Type',
       'Country of Giftor', 'Giftor Name'],
      dtype='object')

In [10]:
fg.describe

<bound method NDFrame.describe of           ID    OPEID                     Institution Name          City  \
0          1   102000        Jacksonville State University  Jacksonville   
1          2   104700                      Troy University          Troy   
2          3   105100                University of Alabama    Tuscaloosa   
3          4   105100                University of Alabama    Tuscaloosa   
4          5   105100                University of Alabama    Tuscaloosa   
...      ...      ...                                  ...           ...   
28216  28217  4279700  Albert Einstein College of Medicine         Bronx   
28217  28218  4279700  Albert Einstein College of Medicine         Bronx   
28218  28219  4279700  Albert Einstein College of Medicine         Bronx   
28219  28220  4279700  Albert Einstein College of Medicine         Bronx   
28220  28221  4279700  Albert Einstein College of Medicine         Bronx   

      State  Foreign Gift Received Date  Foreign Gift

In [21]:
fg.dtypes

ID                             int64
OPEID                          int64
Institution Name              object
City                          object
State                         object
Foreign Gift Received Date     int64
Foreign Gift Amount            int64
Gift Type                     object
Country of Giftor             object
Giftor Name                   object
dtype: object

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

In [26]:
# Foreign Gift Amount >> numeric variable
vec1 = fg['Foreign Gift Amount'].isna()
print("Missing entries for Foreign Gift Amt: ", np.sum(vec1))

# Country of Giftor >> categorical variable
vec2 = fg["Country of Giftor"].isna()
print("Missing entries for Country of Giftor: ", np.sum(vec2))

# Tabulate Foreign Gift Amount
print("Tabulated Information re: Foreign Gift Amount")
print(fg["Foreign Gift Amount"].describe())

print("Tabulated Data re: Country of Giftor")
fg["Country of Giftor"].value_counts()

Missing entries for Foreign Gift Amt:  0
Missing entries for Country of Giftor:  0
Tabulated Information re: Foreign Gift Amount
count    2.822100e+04
mean     5.882327e+05
std      3.222011e+06
min     -5.377700e+05
25%      5.700000e+03
50%      9.461500e+04
75%      3.761420e+05
max      1.000000e+08
Name: Foreign Gift Amount, dtype: float64
Tabulated Data re: Country of Giftor


Country of Giftor
ENGLAND                  3655
CHINA                    2461
CANADA                   2344
JAPAN                    1896
SWITZERLAND              1676
                         ... 
NEPAL                       1
ST. KITTS-NEVIS             1
TUNISIA                     1
MARSHAL ISLANDS (THE)       1
ANTIGUA                     1
Name: count, Length: 155, dtype: int64

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

Questions:
- How much money is given to each state in total?
- How much money is given from each country?
- Combine those, how much money is each country giving each state?
- Let's see a breakdown of what type of gifts each country is giving.

Prediction Tools:
- Can a model predict what a foreign gift is given its country of origin, amount, and the name of the receiving institution?
- Can a model predict how much money in foreign gifts a country will spend in a year given the exisint dates, countries, and amounts?

Practical/Ethical Questions:
- How/why is money distributed across institutions and states in the way it is, according to the data?
- What are the trends in foreign gifts from countries over time, and how is that information associated with the diplomatic/foreign relations of those countries over time?
- Do any transactions which appear anomalous or abnormally "routine"/can this data be used to search for corruption in giftors/recipient institutions?

What other data would I want...
- Intended purpose of gift (e.g., Research, Manufacturing, Goodwill/Foreign Relations)
- Number of foreign nationals from each country at the recipient institutions >> could a predictive model predict the amount of money given to an institution in a given year when incorporating how many of a given country's nationals are associated with (i.e., students, employees, etc) the recipient institution?

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.