# Project 2: Elvin Yuen

## Scientific Question: How does country development affect lung cancer rates?

Lung cancer is one of the leading cancers in the world. It occurs in the lungs, and is most common in people who smoke. The HDI (Human Development Index) by the United Nations Development Programme is the standard for measuring human development in countries.

Of the many databases that conduct research on cancer rates globally, GOBOCAN (Global Cancer Observatory, also known as GCO) is a common source used by researchers. 
GLOBOCAN's website (https://gco.iarc.fr/) states that they are an "interactive web-based platform presenting global cancer statistics to inform cancer control and cancer research."

The WCRF (World Cancer Research Fund) at wcrf.org compiles the information into an easy-to-read formatted chart.

## If a country is more developed, then it would have a higher rate of lung cancer. 

The lung cancer data distribution is plotted. We observe the mean values. We also plot the distribution of the HDI data. A t-test runs to compare the 2 distributions. We will set alpha as 0.05, and if the p-value is less than 0.05, we reject the null hypothesis. If p-value is more than 0.05, the distributions are similar to random sampling. 

I downloaded cancer data from https://gco.iarc.fr/. I specified I wanted lung cancer incidence rates by countries. I downloaded the data as a .csv file. I downloaded HDI data from http://hdr.undp.org/. I clicked on "download data" after specifying I wanted HDI data. It downloaded as a .csv file.

## Loading in Packages

**matplotlib**: plotting tool

**numpy**: mathematical functions in array structure

**pandas**: data structures and numerical tables

**seaborn**: data visualization based on matplotlib

**warnings**: warning messages for non-fatal errors

**pylab**: combination of pylot and numpy that includes many mathematical plotting and array functions

**scipy**: manipulates data, an extension of numpy

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
from pylab import rcParams
from scipy.stats import f_oneway
from scipy.stats import ttest_ind

In [4]:
%matplotlib inline
warnings.filterwarnings("ignore")
rcParams['figure.figsize'] = 20,10
rcParams['font.size'] = 30
sns.set()
np.random.seed(8)

In [5]:
def plot_distribution(inp):
    plt.figure()
    ax = sns.distplot(inp)
    plt.axvline(np.mean(inp), color="k", linestyle="dashed", linewidth=5)
    _, max_ = plt.ylim()
    plt.text(
        inp.mean() + inp.mean() / 10,
        max_ - max_ / 10,
        "Mean: {:.2f}".format(inp.mean()),
    )
    return plt.figure



In [10]:
cancer = pd.read_csv('cancer_data.csv',encoding='latin-1')
hdi = pd.read_csv('human-development-index.csv',encoding='latin-1')







          Country   Number
0           Kenya    42116
1     South Sudan     6312
2        Zimbabwe    16083
3   United States  2281658
4  United Kingdom   457960
5          Canada   274364
