# Data Science at Scale

## What do we mean by "scale"

* Scale is determined by
    * Size of data
    * Capacity of hardware

## Big Data is

* data you can't open in Excel
* data you can't fit in RAM
* data you can't fit on a single machine

## A data scientist operates on many scales

* Can't open in Excel $\rightarrow$ use `Pandas` and chunking
* Can't fit in RAM $\rightarrow$ use a database or stream the file
* Can't fit on a single machine $\rightarrow$ use Hadoop and `PySpark`

## Example - Average Super Hero Height - Pandas

In [1]:
import pandas as pd
from dfply import *

heroes = pd.read_csv('./data/heroes_information.csv')
major_publisher = ['Marvel Comics', 'DC Comics']

(heroes >> 
   filter_by(X.Publisher.isin(major_publisher)) >>
   group_by(X.Publisher) >>
   summarise(mean_height = mean(X.Height)))

Unnamed: 0,Publisher,mean_height
0,DC Comics,91.072093
1,Marvel Comics,142.756443


## Example - Average Super Hero Height - `sqlalchemy`

In [2]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine, func
from heroes import Base, Hero

engine = create_engine('sqlite:///databases/heroes.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()


session.query(Hero.publisher, func.avg(Hero.height).label('avg_ht')).\
  filter(Hero.publisher.in_(major_publisher)).\
  group_by(Hero.publisher).\
  all()

[('DC Comics', 180.90068493150685), ('Marvel Comics', 190.5108024691358)]

## Example - Average Super Hero Height - `pyspark`

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean

spark1 = SparkSession.builder.appName('Ops').getOrCreate()
df = spark1.read.csv('data/heroes_information.csv', inferSchema=True, header=True)

df.where(col('Publisher').isin(major_publisher)).\
   groupBy("Publisher").\
   agg(mean('Height')).\
   show()

## <font color="red"> Exercise 1: Compare and Contrast </font>

<img src="img/all_three_1.png" width=600>

Your thoughts here

## Filter using in/isin

<img src="img/all_three_2.png" width=600>

## Group by publisher

<img src="img/all_three_3.png" width=600>

## Aggregate the mean height

<img src="img/all_three_4.png" width=500>

## Course outline

* Part 1 - Working with Tabular Data

* Part 2 - Working with Unstructured Data


## Part 1 - Working with Tabular Data

* Cleaning and prepping data in `Pandas` (2-3 weeks)
* SQL Alchemy (2 weeks)
* Spark SQL (3 weeks)

## Part 2 - Working with Unstructured Data

* Introduction to functional list processing (3 weeks)
* Processing Unstructured Data with Spark
* Project