# Setup

In [1]:
import pandas as pd

# Exercise 1

**We want to look at data for the Facebook, Apple, Amazon, Netflix, and Google
(FAANG) stocks, but we were given each as a separate CSV file (obtained using the
stock_analysis package we will build in Chapter 7, Financial Analysis – Bitcoin
and the Stock Market). Combine them into a single file and store the dataframe of
the FAANG data as faang for the rest of the exercises:**

* Read in the aapl.csv , amzn.csv , fb.csv , goog.csv , and nflx.csv files.
* Add a column to each dataframe, called ticker , indicating the ticker symbol it for (Apple's is AAPL, for example);  this is how you look up a stock. In this case, the filenames happen to be the ticker symbols.
* Append them together into a single dataframe.
* Save the result in a CSV file called faang.csv.

In [2]:
tickers = ['aapl','amzn', 'fb', 'goog', 'nflx']

def create_df(filename):
    return pd.read_csv(f'exercises/{filename}.csv')

def add_column_to_df(df, columnName, columnValue):
    df[columnName] = columnValue
    return df

def append_dfs(df1, df2):
    return df1.append(df2)

def create_dfs_from_filenames(filenames):
    df = pd.DataFrame()
    for name in filenames:
        newDf = create_df(name)
        newDf = add_column_to_df(newDf, 'ticker', name.upper())
        df = append_dfs(df, newDf)
    return df

faangDf = create_dfs_from_filenames(tickers)
#faangDf.to_csv('exercises/faang.csv')
faangDf.head()

Unnamed: 0,date,high,low,open,close,volume,ticker
0,2018-01-02,43.075001,42.314999,42.540001,43.064999,102223600.0,AAPL
1,2018-01-03,43.637501,42.990002,43.1325,43.057499,118071600.0,AAPL
2,2018-01-04,43.3675,43.02,43.134998,43.2575,89738400.0,AAPL
3,2018-01-05,43.842499,43.262501,43.360001,43.75,94640000.0,AAPL
4,2018-01-08,43.9025,43.482498,43.587502,43.587502,82271200.0,AAPL


# Exercise 2

With faang , use type conversion to cast the values of the date column into
datetimes and the volume column into integers. Then, sort by date and ticker .

In [3]:
def convert_column_type(df, columnName, type_):
    df[columnName] = df[columnName].astype(type_)
    return df

def sort_df(df, criteria):
    return df.sort_values(by=criteria)

faangDf = convert_column_type(faangDf, 'date', 'datetime64')
faangDf = convert_column_type(faangDf, 'volume', 'int')
faangDf = sort_df(faangDf, ['date', 'ticker'])
faangDf.head()


Unnamed: 0,date,high,low,open,close,volume,ticker
0,2018-01-02,43.075001,42.314999,42.540001,43.064999,102223600,AAPL
0,2018-01-02,1190.0,1170.51001,1172.0,1189.01001,2694500,AMZN
0,2018-01-02,181.580002,177.550003,177.679993,181.419998,18151900,FB
0,2018-01-02,1066.939941,1045.22998,1048.339966,1065.0,1237600,GOOG
0,2018-01-02,201.649994,195.419998,196.100006,201.070007,10966900,NFLX


# Exercise 3

Find the seven rows in faang with the lowest value for volume.

In [4]:
def find_smallest(df, rows, columnName):
    return df.nsmallest(rows, columnName)

faangDfSmallest = find_smallest(faangDf, 7, 'volume')
faangDfSmallest.head(10)

Unnamed: 0,date,high,low,open,close,volume,ticker
126,2018-07-03,1135.819946,1100.02002,1135.819946,1102.890015,679000,GOOG
226,2018-11-23,1037.589966,1022.398987,1030.0,1023.880005,691500,GOOG
99,2018-05-24,1080.469971,1066.150024,1079.0,1079.23999,766800,GOOG
130,2018-07-10,1159.589966,1149.589966,1156.97998,1152.839966,798400,GOOG
152,2018-08-09,1255.541992,1246.01001,1249.900024,1249.099976,848600,GOOG
159,2018-08-20,1211.0,1194.625977,1205.02002,1207.77002,870800,GOOG
161,2018-08-22,1211.839966,1199.0,1200.0,1207.329956,887400,GOOG


# Exercise 4

Right now, the data is somewhere between long and wide format. Use melt()
to make it completely long format. Hint: date and ticker are our ID variables
(they uniquely identify each row). We need to melt the rest so that we don't have
separate columns for open , high , low , close , and volume .

In [5]:
def convert_to_long(df, keys, values):
    return df.melt(
        id_vars=keys, 
        value_vars=values
        )

longFaangDf = convert_to_long(
    faangDf, 
    ['ticker', 'date'], 
    ['open', 'high', 'low', 'close', 'volume'])

longFaangDf.head()

Unnamed: 0,ticker,date,variable,value
0,AAPL,2018-01-02,open,42.540001
1,AMZN,2018-01-02,open,1172.0
2,FB,2018-01-02,open,177.679993
3,GOOG,2018-01-02,open,1048.339966
4,NFLX,2018-01-02,open,196.100006


# Exercise 5

Suppose we found out that on July 26, 2018 there was a glitch in how the data was
recorded. How should we handle this? Note that there is no coding required for this
exercise.

* The easy way is to drop that date and interpolate. 
* But there's a posibility that the missing data varied wildly in that day, so it's a good idea to do some more research before interpolating

# Exercise 6

**The European Centre for Disease Prevention and Control (ECDC) provides
an open dataset on COVID-19 cases called daily number of new reported cases
of COVID-19 by country worldwide ( https://www.ecdc.europa.eu/
en/publications-data/download-todays-data-geographic-
distribution-covid-19-cases-worldwide ). This dataset is updated
daily, but we will use a snapshot that contains data from January 1, 2020 through
September 18, 2020. Clean and pivot the data so that it is in wide format:**

* Read in the covid19_cases.csv file.
* Create a date column using the data in the dateRep column and the pd.to_datetime() function.
* Set the date column as the index and sort the index.
* Replace all occurrences of United_States_of_America and United_Kingdom with USA and UK, respectively. Hint: the replace() method can be run on the dataframe as a whole.
* Using the countriesAndTerritories column, filter the cleaned COVID-19 cases data down to Argentina, Brazil, China, Colombia, India, Italy, Mexico, Peru, Russia, Spain, Turkey, the UK, and the USA.
* Pivot the data so that the index contains the dates, the columns contain the country names, and the values are the case counts (the cases column). Be sure to fill in NaN values with 0.

In [6]:
countries = [
    'Argentina', 'Brazil', 'China', 'Colombia', 'India', 'Italy', 
    'Mexico', 'Peru', 'Russia', 'Spain', 'Turkey', 'UK', 'USA'
    ]

covidDf = pd.read_csv('exercises/covid19_cases.csv')

covidDf = covidDf\
    .assign(
        date=lambda x: pd.to_datetime(
            x.dateRep, 
            format='%d/%m/%Y'
            )
        )

covidDf = covidDf\
    .set_index('date')\
    .sort_index()

covidDf = covidDf\
    .replace('United_States_of_America', 'USA')\
    .replace('United_Kingdom', 'UK')\

covidDf = covidDf[
    covidDf.countriesAndTerritories.isin(countries)
    ]\
    .reset_index()

covidDf = covidDf\
    .pivot(
        index='date', 
        columns='countriesAndTerritories', 
        values='cases')\
    .fillna(0)

covidDf.head()


countriesAndTerritories,Argentina,Brazil,China,Colombia,India,Italy,Mexico,Peru,Russia,Spain,Turkey,UK,USA
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-03,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-05,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Exercise 7

**In order to determine the case totals per country efficiently, we need the aggregation
skills we will learn about in Chapter 4, Aggregating Pandas DataFrames, so the
ECDC data in the covid19_cases.csv file has been aggregated for us and saved
in the covid19_total_cases.csv file. It contains the total number of cases
per country.**

Use this data to find the 20 countries with the largest COVID-19 case
totals. Hints: when reading in the CSV file, pass in index_col='cases' , and
note that it will be helpful to transpose the data before isolating the countries.


In [13]:
casesDf = pd\
    .read_csv(
        'exercises/covid19_total_cases.csv',
        index_col='index')\
    .T

top20casesDf = casesDf\
    .nlargest(20, 'cases')\
    .sort_values(by='cases', ascending=False)

top20casesDf.head(30)


index,cases
USA,6724667
India,5308014
Brazil,4495183
Russia,1091186
Peru,756412
Colombia,750471
Mexico,688954
South_Africa,657627
Spain,640040
Argentina,601700


<hr>
<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="./python_101.ipynb">
            <button>Python 101</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="../../solutions/ch_01/solutions.ipynb">
            <button>Solutions</button>
        </a>
        <a href="../ch_02/1-pandas_data_structures.ipynb">
            <button>Chapter 2 &#8594;</button>
        </a>
    </div>
</div>
<hr>