# Assignment 1
### Understanding Uncertainty
### Due 9/5
### Seth Spire

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [1]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Download file already exists
Data directory already exists


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

In [2]:
import pandas as pd
foreign_gifts = pd.read_csv('data/ForeignGifts_edu.csv')

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

In [3]:
pd.set_option('display.float_format', '{:.2f}'.format)
print("missing values:", foreign_gifts['Foreign Gift Amount'].isna().sum())
print("rows with negative amounts:", len(foreign_gifts[foreign_gifts['Foreign Gift Amount'] < 0]))
foreign_gifts['Foreign Gift Amount'].describe()



missing values: 0
rows with negative amounts: 24


count      28221.00
mean      588232.72
std      3222011.43
min      -537770.00
25%         5700.00
50%        94615.00
75%       376142.00
max     99999999.00
Name: Foreign Gift Amount, dtype: float64

> The variable `Foreign Gift Amount`is numerical. It has no missing values, but it does have 24 negative ones, which feels like mistakes in the data. Doing some research on this data set at the Federal Student Aid website (https://fsapartners.ed.gov/knowledge-center/topics/section-117-foreign-gift-and-contract-reporting/section-117-foreign-gift-and-contract-data#), it says, "Any negative amounts reported in the spreadsheets indicate an institutionally reported adjustment to a previously reported amount." So those are correctly negative but have a different meaning as they are to fix previous data. The median is about $94k while the mean is about $588k so there must be a few very large values (like the max which is nearly $100 million) causing it to skew right.

In [4]:
print("missing values:", foreign_gifts['Gift Type'].isna().sum())
foreign_gifts['Gift Type'].value_counts()

missing values: 0


Gift Type
Contract         17274
Monetary Gift    10936
Real Estate         11
Name: count, dtype: int64

> The variable `Gift Type` is categorical. It has no missing values and has 3 different categories: Contract, Monetary Gift, and Real Estate. Contracts are the majority at approximately 61%. Monetary Gifts make up most of the rest at about 38%. While Real Estate is uncommon with just 11 or the over 28,000 rows.

In [5]:
print("missing values:", foreign_gifts['Country of Giftor'].isna().sum())
pd.set_option('display.max_rows', None)
foreign_gifts['Country of Giftor'].value_counts()

missing values: 0


Country of Giftor
ENGLAND                     3655
CHINA                       2461
CANADA                      2344
JAPAN                       1896
SWITZERLAND                 1676
SAUDI ARABIA                1610
FRANCE                      1437
GERMANY                     1394
HONG KONG                   1080
SOUTH KOREA                  811
QATAR                        693
THE NETHERLANDS              512
KOREA                        452
INDIA                        434
TAIWAN                       381
ISRAEL                       373
UNITED ARAB EMIRATES         365
SINGAPORE                    360
AUSTRALIA                    335
ITALY                        318
KUWAIT                       302
DENMARK                      292
SWEDEN                       275
BRAZIL                       256
SPAIN                        240
NORWAY                       240
IRELAND                      224
MEXICO                       221
INDONESIA                    184
NIGERIA                  

> The variable `Country of Giftor` is categorical. There are no missing values. There are 155 countries represented in the data as having given gifts (or contracts) to US higher education. England has given the most gifts by far, followed by China and Canada. Qatar is fairly high on the list despite being one of the smaller countries in the world. They certainly do have plenty of money.

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

> This data set does have dates for the gifts (though they appear to be annoyingly coded from Excel as the number of days since January 1, 1900), so I would be fascinated in doing some time series analysis, asking questions about whether or not certain events like the COVID-19 pandemic or the 2008 financial crisis affected the number of and total value of gifts given. This could also be used to model/predict future gifts as well. I would also be interested in where the biggest flows of money are going: which countries or giftors have the biggest relationships with a specific university. 

> For much of this, the stakeholder would be leadership of universities (president, board of trustees, etc.) who depend on these gifts and want to know what their future projections look like, amongst many other things.

> Practically, there may be concerns about how stable and predictive a model of future gifts could be. Gifts may be so dependent on the exact needs of a time (for example, funding research for COVID-19 could have given universities a temporary increase that is not indicative of future giving). Ethically, university leadership groups would likely be using this to try and maximize their total gifts received, which means they coul choose to target certain countries or giftors which may bring about moral or ethical dilemmas based on who is giving the money (for example, just because UVA could get rich taking money from ISIS, they probably shouldn't).

> I would love to be able to know what sort of projects the universities are using the money for. For example, what is Cornell using the six $100 million contracts from Qatar for? I would also like to combine this with some data about the universities, such as their endowment size and stuent enrollment. 

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.