<h2> Working with real-world data </h2>

<h3> Introduction </h3>

Now that we've introduced the big computational tools, it's time to start using them to analyze some real-world data sets. In this week's notebook you'll analyze some summary statistics for a data set, understand how they differ, and think about what they each measure.

This week, we'll work with GDP (gross domestic product) data, which gives a measurement of the total economic activity of a region. The source we'll use is [Our World in Data](https://ourworldindata.org/grapher/national-gdp-constant-usd-wb?tab=table&time=latest), which also has [really nice interactive charts](https://ourworldindata.org/grapher/national-gdp-constant-usd-wb). I've cleaned the data set very slightly to remove empty data; the raw data is in `gdp_data.csv` and we can import it into Python:


In [71]:
# Open the file and import the data
with open('gdp_data.csv') as f:
    raw_data = f.readlines()

# Reformat into a usable thing. Format is {country: GDP}
data = {}
for row in raw_data[1:]:
    country = row.split(',')[0]
    # This line looks scary, but it's just to strip out a bunch of commas and format as a number
    # Feel free to ignore what it's doing!
    gdp = float(row.split(',')[1][:-1]) 
    
    data[country] = gdp
    
countries = list(data.keys())
gdps = list(data.values())
sorted_gpds = sorted(gdps)

# Uncomment this line to print the raw data:
# print(data)

# Uncomment this line to print the list of countries:
# print(countries)

For example, the GDP of the US in 2021 was estimated as nearly 21 trillion dollars, while the GDP of Nauru is roughly about 92 million dollars:

In [56]:
print(data['United States'])
print(data['Nauru'])

20529460000000.0
92009870.0


There are 205 regions listed in the data set; many are countries, but some are regions within countries (such as Greenland). The countries range in size from China and the United States (each with a GDP above 15 trillion dollars) to Tuvalu, Nauru, and Kiribati (all under 200 million dollars). The list `sorted_gdps` contains all the GDPs listed in increasing order.

<h3> Questions </h3>

* Find the mean GDP of the regions represented in this data set. Which country is closest to the mean?

* Find the 10th, 50th, and 90th percentiles of GDPs for the regions represented in this data set. Which country appears at each of these percentiles?

* An *outlier* is a [data point that differs significantly from other observations](https://en.wikipedia.org/wiki/Outlier). Does this data set have an outliers?

* The mean and the median differ by a factor of over 10. Why is this? Which one do you think is a better representation of a "typical" country?

*Using the data hint*: You can open the file `gdp_data.csv` to see all the data in a nice format that's sorted in alphabetical order.

<h4> Submission </h4>

Use the cells below for any computations you need to do, and a separate cell where you write your answers to each of the four questions above. Then export your notebook as a pdf and upload it to Gradescope under this week's Jupyter assignment.

In [72]:
# Add your code here

*Add your answers here*