# Aggregating Data with Pandas

**8.1.1 Intended Learning Outcomes**

After this activity, the student should be able to:
- Demonstrate querying and merging of dataframes
- Perform advanced calculations on dataframes
- Aggregate dataframes with pandas and numpy
- Work with time series data

**8.1.2 Resources**
- Computing Environment using Python 3.x
- Attached Datasets (under Instructional Materials)

**8.1.3 Procedures**

The procedures can be found in the canvas module. Check the following under topics:
- 8.1 Weather Data Collection - [github link](https://github.com/zephyrowwa/M8-Aggregating-Pandas-DataFrames/tree/main/8.1)
- 8.2 Querying and Merging - [github link](https://github.com/zephyrowwa/M8-Aggregating-Pandas-DataFrames/tree/main/8.2)
- 8.3 Dataframe Operations - [github link](https://github.com/zephyrowwa/M8-Aggregating-Pandas-DataFrames/tree/main/8.3)
- 8.4 Aggregations - [github link](https://github.com/zephyrowwa/M8-Aggregating-Pandas-DataFrames/tree/main/8.4)
- 8.5 Time Series - [github link](https://github.com/zephyrowwa/M8-Aggregating-Pandas-DataFrames/tree/main/8.5)


**8.1.4 Data Analysis**
- Provide some comments here about the results of the procedures.


**8.1.5 Supplementary Activity**

Using the CSV files provided and what we have learned so far in this module complete the following exercises:

In [91]:
import pandas as p
import numpy as n
eq = p.read_csv('/content/earthquakes.csv')
faang = p.read_csv('/content/faang.csv')

# 1. With the earthquakes.csv file, select all the earthquakes in Japan with a magType of mb and a magnitude of 4.9 or greater

In [92]:
eq

Unnamed: 0,mag,magType,time,place,tsunami,parsed_place
0,1.35,ml,1539475168010,"9km NE of Aguanga, CA",0,California
1,1.29,ml,1539475129610,"9km NE of Aguanga, CA",0,California
2,3.42,ml,1539475062610,"8km NE of Aguanga, CA",0,California
3,0.44,ml,1539474978070,"9km NE of Aguanga, CA",0,California
4,2.16,md,1539474716050,"10km NW of Avenal, CA",0,California
...,...,...,...,...,...,...
9327,0.62,md,1537230228060,"9km ENE of Mammoth Lakes, CA",0,California
9328,1.00,ml,1537230135130,"3km W of Julian, CA",0,California
9329,2.40,md,1537229908180,"35km NNE of Hatillo, Puerto Rico",0,Puerto Rico
9330,1.10,ml,1537229545350,"9km NE of Aguanga, CA",0,California


In [93]:
jpnmb49 = eq.query('magType == "mb" and mag >= 4.9 and parsed_place == "Japan"')
jpnmb49.sort_values(by='mag')

Unnamed: 0,mag,magType,time,place,tsunami,parsed_place
1563,4.9,mb,1538977532250,"293km ESE of Iwo Jima, Japan",0,Japan
3072,4.9,mb,1538579732490,"15km ENE of Hasaki, Japan",0,Japan
3632,4.9,mb,1538450871260,"53km ESE of Hitachi, Japan",0,Japan
2576,5.4,mb,1538697528010,"37km E of Tomakomai, Japan",0,Japan


# 2. Create bins for each full number of magnitude (for example, the first bin is 0-1, the second is 1-2, and so on) with a magType of ml and count how many are in each bin

In [94]:
# filter to only ml magtype
binml = eq[eq['magType'] == 'ml']
# create bins with whole num magnitudes
bins = range(int(eq['mag'].max()) + 1)
# count how many magnitudes and put inside bins
counts, bins = p.cut(eq['mag'], bins=bins, retbins=True)
magcounts = counts.value_counts().sort_index()

eqmlbin = p.DataFrame({'Magnitude (ml)':bins[:-1], 'occurrences': magcounts})
eqmlbin
# yes

Unnamed: 0,Magnitude (ml),occurrences
"(0, 1]",0,2941
"(1, 2]",1,3802
"(2, 3]",2,1157
"(3, 4]",3,233
"(4, 5]",4,534
"(5, 6]",5,117
"(6, 7]",6,7


# 3. Using the faang.csv file, group by the ticker and resample to monthly frequency.
Make the following aggregations:
- Mean of the opening price
- Maximum of the high price
- Minimum of the low price
- Mean of the closing price
- Sum of the volume traded


In [95]:
faang.dtypes

ticker     object
date       object
open      float64
high      float64
low       float64
close     float64
volume      int64
dtype: object

In [96]:
faang['date']=p.to_datetime(faang['date']) # change object to datetime
faang.set_index('date',inplace=True) # confirm change with inplace, set date as index

In [97]:
faang.dtypes

ticker     object
open      float64
high      float64
low       float64
close     float64
volume      int64
dtype: object

In [98]:
faang

Unnamed: 0_level_0,ticker,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-02,FB,177.68,181.58,177.5500,181.42,18151903
2018-01-03,FB,181.88,184.78,181.3300,184.67,16886563
2018-01-04,FB,184.90,186.21,184.0996,184.33,13880896
2018-01-05,FB,185.59,186.90,184.9300,186.85,13574535
2018-01-08,FB,187.20,188.90,186.3300,188.28,17994726
...,...,...,...,...,...,...
2018-12-24,GOOG,973.90,1003.54,970.1100,976.22,1590328
2018-12-26,GOOG,989.01,1040.00,983.0000,1039.46,2373270
2018-12-27,GOOG,1017.15,1043.89,997.0000,1043.88,2109777
2018-12-28,GOOG,1049.62,1055.56,1033.1000,1037.08,1413772


In [99]:
# faangtikd =  faang.groupby('ticker')
HHfaangtikd = faang.groupby('ticker').resample('M') #arrange them accoring to ticker

agfng = HHfaangtikd.agg({'open':'mean','high':'max','low':'min','close':'mean','volume':'sum'})
agfng

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,2018-01-31,170.71469,176.6782,161.5708,170.699271,659679440
AAPL,2018-02-28,164.562753,177.9059,147.9865,164.921884,927894473
AAPL,2018-03-31,172.421381,180.7477,162.466,171.878919,713727447
AAPL,2018-04-30,167.332895,176.2526,158.2207,167.286924,666360147
AAPL,2018-05-31,182.635582,187.9311,162.7911,183.207418,620976206
AAPL,2018-06-30,186.605843,192.0247,178.7056,186.508652,527624365
AAPL,2018-07-31,188.065786,193.765,181.3655,188.179724,393843881
AAPL,2018-08-31,210.460287,227.1001,195.0999,211.477743,700318837
AAPL,2018-09-30,220.611742,227.8939,213.6351,220.356353,678972040
AAPL,2018-10-31,219.489426,231.6645,204.4963,219.137822,789748068


# 4. Build a crosstab with the earthquake data between the tsunami column and the magType column.
Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magType along the columns.


In [100]:
# eq.columns

eqct = p.DataFrame(p.crosstab(eq['tsunami'],eq['magType']).max()).T
eqct

magType,mb,mb_lg,md,mh,ml,ms_20,mw,mwb,mwr,mww
0,574,30,1796,12,6798,1,2,2,14,42


# 5. Calculate the rolling 60-day aggregations of OHLC data by ticker for the FAANG data. Use the same aggregations as exercise no. 3

In [101]:
# agfng.dtypes
r60fng = faang.groupby([p.Grouper(freq='2M'),'ticker'])

r60fng.agg({'open':'mean','high':'max','low':'min','close':'mean','volume':'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
date,ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-31,AAPL,170.71469,176.6782,161.5708,170.699271,659679440
2018-01-31,AMZN,1301.377143,1472.58,1170.51,1309.010952,96371290
2018-01-31,FB,184.364762,190.66,175.8,184.962857,495655736
2018-01-31,GOOG,1127.200952,1186.89,1045.23,1130.770476,28738485
2018-01-31,NFLX,231.269286,286.81,195.42,232.908095,238377533
2018-03-31,AAPL,168.688533,180.7477,147.9865,168.574328,1641621920
2018-03-31,AMZN,1497.01275,1617.54,1265.93,1493.8155,268184171
2018-03-31,FB,176.90375,195.32,149.02,176.71,1512854463
2018-03-31,GOOG,1092.55575,1177.05,980.64,1089.93075,87814154
2018-03-31,NFLX,292.839,333.98,236.11,292.8555,448035310


# 6. Create a pivot table of the FAANG data that compares the stocks. Put the ticker in the rows and show the averages of the OHLC and volume traded data.


In [102]:
import scipy.stats as sts

cols = faang.columns[:-1]

faangpv = p.pivot_table(faang,index='ticker',values=cols, aggfunc='mean')
faangpv

Unnamed: 0_level_0,close,high,low,open,volume
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AAPL,186.986218,188.906858,185.135729,187.038674,34021450.0
AMZN,1641.726175,1662.839801,1619.840398,1644.072669,5649563.0
FB,171.510936,173.615298,169.30311,171.454424,27687980.0
GOOG,1113.225139,1125.777649,1101.001594,1113.554104,1742645.0
NFLX,319.290299,325.224583,313.187273,319.620533,11470300.0


# 7. Calculate the Z-scores for each numeric column of Netflix's data (ticker is NFLX) using apply().


In [103]:
fngagg_nlfx = agfng.query('ticker == "NFLX"')
fngagg_nlfx

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NFLX,2018-01-31,231.269286,286.81,195.42,232.908095,238377533
NFLX,2018-02-28,270.873158,297.36,236.11,271.443684,184585819
NFLX,2018-03-31,312.712857,333.98,275.9,312.228095,263449491
NFLX,2018-04-30,309.129529,338.82,271.2239,307.46619,262064417
NFLX,2018-05-31,329.779759,356.1,305.73,331.536818,142051114
NFLX,2018-06-30,384.557595,423.2056,352.82,384.133333,244032001
NFLX,2018-07-31,380.96909,419.77,328.0,381.515238,305487432
NFLX,2018-08-31,345.409591,376.8085,310.928,346.257826,213144082
NFLX,2018-09-30,363.326842,383.2,335.83,362.641579,170832156
NFLX,2018-10-31,340.025348,386.7999,271.2093,335.445652,363589920


In [104]:
# ZSFN = fngagg_nlfx.apply(lambda column: sts.zscore(cols), axis = 0)
# ZSFN ↑↑↑ NOT WORKIN 9:56 3/31

def calculate_zscores(data):
  return (data - data.mean()) / data.std()
ZSNF = fngagg_nlfx.apply(calculate_zscores)
ZSNF

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NFLX,2018-01-31,-1.831799,-1.429505,-1.786494,-1.802025,-0.025874
NFLX,2018-02-28,-1.00252,-1.200973,-0.930753,-0.990095,-0.927949
NFLX,2018-03-31,-0.126424,-0.407718,-0.093939,-0.130783,0.394577
NFLX,2018-04-30,-0.201457,-0.302875,-0.192281,-0.231115,0.37135
NFLX,2018-05-31,0.230946,0.071441,0.533408,0.276044,-1.641247
NFLX,2018-06-30,1.377957,1.525068,1.523746,1.384232,0.06895
NFLX,2018-07-31,1.302816,1.450647,1.001762,1.32907,1.099544
NFLX,2018-08-31,0.558224,0.520024,0.642726,0.58621,-0.449033
NFLX,2018-09-30,0.933399,0.658475,1.166433,0.931409,-1.158595
NFLX,2018-10-31,0.445481,0.736456,-0.192588,0.358402,2.073911


# 8. Add event descriptions:
- Create a dataframe with the following three columns: ticker, date, and event. The columns should have the following values:
 - ticker: 'FB'
 - date: ['2018-07-25', '2018-03-19', '2018-03-20']
 - event: ['Disappointing user growth announced after close.', 'Cambridge Analytica story', 'FTC investigation']
- Set the index to ['date', 'ticker']
- Merge this data with the FAANG data using an outer join

In [105]:
df = {'ticker': 'FB',
      'date':['2018-07-25', '2018-03-19', '2018-03-20'],
      'event':['Disappointing user growth announced after close.', 'Cambridge Analytica story', 'FTC investigation']}

put = p.DataFrame(df)
put

Unnamed: 0,ticker,date,event
0,FB,2018-07-25,Disappointing user growth announced after close.
1,FB,2018-03-19,Cambridge Analytica story
2,FB,2018-03-20,FTC investigation


In [106]:
put['date'] = p.to_datetime(put['date'])
put.set_index(['ticker','date'])

oncol = list(put.columns[:-1])

In [107]:
foj = p.merge(put,faang,on = oncol, how = 'outer')
foj

Unnamed: 0,ticker,date,event,open,high,low,close,volume
0,FB,2018-07-25,Disappointing user growth announced after close.,215.715,218.62,214.27,217.50,64592585
1,FB,2018-03-19,Cambridge Analytica story,177.010,177.17,170.06,172.56,88140060
2,FB,2018-03-20,FTC investigation,167.470,170.20,161.95,168.15,129851768
3,FB,2018-01-02,,177.680,181.58,177.55,181.42,18151903
4,FB,2018-01-03,,181.880,184.78,181.33,184.67,16886563
...,...,...,...,...,...,...,...,...
1250,GOOG,2018-12-24,,973.900,1003.54,970.11,976.22,1590328
1251,GOOG,2018-12-26,,989.010,1040.00,983.00,1039.46,2373270
1252,GOOG,2018-12-27,,1017.150,1043.89,997.00,1043.88,2109777
1253,GOOG,2018-12-28,,1049.620,1055.56,1033.10,1037.08,1413772


# 9. Use the transform() method on the FAANG data to represent all the values in terms of the first date in the data.
To do so, divide all the values for each ticker by the values for the first date in the data for that ticker. This is referred to as an index, and the data for the first date is the base (https://ec.europa.eu/eurostat/statistics-explained/index.php/Beginners:Statisticalconcept-Indexandbaseyear). When data is in this format, we can easily see growth over time. Hint: transform() can take a function name.


In [108]:
ftr = faang.groupby('ticker').transform(lambda x: x / x.iloc[0])
ftr

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,1.000000,1.000000,1.000000,1.000000,1.000000
2018-01-03,1.023638,1.017623,1.021290,1.017914,0.930292
2018-01-04,1.040635,1.025498,1.036889,1.016040,0.764707
2018-01-05,1.044518,1.029298,1.041566,1.029931,0.747830
2018-01-08,1.053579,1.040313,1.049451,1.037813,0.991341
...,...,...,...,...,...
2018-12-24,0.928993,0.940578,0.928131,0.916638,1.285047
2018-12-26,0.943406,0.974750,0.940463,0.976019,1.917695
2018-12-27,0.970248,0.978396,0.953857,0.980169,1.704782
2018-12-28,1.001221,0.989334,0.988395,0.973784,1.142383
