# Lab 2 - Automatic Creation of `sqlalchemy` data type `dict`

So far, we have been manually constructing the `sqlalchemy` type `dict`, but this approach quickly becomes unwieldy.  Let's look at using the `pandas` types to programmically construct the type `dict`.

In [2]:
import pandas as pd
from dfply import *

In [None]:
!rm databases/baseball.db

## Case Study - People

Let's use the `People.csv` file from the [Lahman’s Baseball Database](http://www.seanlahman.com/baseball-archive/statistics/) as our motivating example, since 

1. It has examples of a number of types
2. It has lots of columns and would be annoying to manually construct the type `dict`.

In [4]:
people = pd.read_csv('~/Desktop/baseball/core/People.csv')
people.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


In [5]:
people.shape

(19370, 24)

## <font color="red"> Exercise 1 </font>

Inspect the types of the `people` table and make note of any necessary changes.

In [6]:
# Your code here

*Your thoughts here*

## <font color="blue"> Key </font>

In [7]:
people.dtypes

playerID         object
birthYear       float64
birthMonth      float64
birthDay        float64
birthCountry     object
birthState       object
birthCity        object
deathYear       float64
deathMonth      float64
deathDay        float64
deathCountry     object
deathState       object
deathCity        object
nameFirst        object
nameLast         object
nameGiven        object
weight          float64
height          float64
bats             object
throws           object
debut            object
finalGame        object
retroID          object
bbrefID          object
dtype: object

1. The columns `debut` and `finalGame` should be `datetime`
2. The birth and death columns should be `Int64` due to missing values.

## Missing `Int64` columns

As mentioned in [Lecture 1/4](./pbpython/notebooks/1_4_more_on_pandas_data_types_key.ipynb), we need to use the most recent version `pandas` (still in development as of this writing) to allow us to have integer columns with missing values.

In [8]:
assert pd.__version__.startswith('0.24'), "Please uncomment and run the pip command to upgrade pandas"
#!pip install --upgrade pandas

## Correcting the `pandas` types

1. We pass `parse_dates` a list of date columns
2. We pass `dtypes` a `dict` of types of the birth and death columns

#### Constructing the `dtype` `dict`

In [9]:
date_cols = ['debut', 'finalGame']
date_cols

['debut', 'finalGame']

In [12]:
birth_death_date_cols = [prefix + time 
                         for prefix in ('birth', 'death') 
                         for time in ('Year', 'Month', 'Day')]
birth_death_date_cols

['birthYear', 'birthMonth', 'birthDay', 'deathYear', 'deathMonth', 'deathDay']

In [13]:
people_dtypes = {col:pd.Int64Dtype() for col in people.columns if col in birth_death_date_cols}
people_dtypes

{'birthYear': Int64Dtype(),
 'birthMonth': Int64Dtype(),
 'birthDay': Int64Dtype(),
 'deathYear': Int64Dtype(),
 'deathMonth': Int64Dtype(),
 'deathDay': Int64Dtype()}

## Rereading the csv with the correct types

In [15]:
people = pd.read_csv('~/Desktop/baseball/core/People.csv', 
                     dtype=people_dtypes, 
                     parse_dates=date_cols)
people.dtypes

playerID                object
birthYear                Int64
birthMonth               Int64
birthDay                 Int64
birthCountry            object
birthState              object
birthCity               object
deathYear                Int64
deathMonth               Int64
deathDay                 Int64
deathCountry            object
deathState              object
deathCity               object
nameFirst               object
nameLast                object
nameGiven               object
weight                 float64
height                 float64
bats                    object
throws                  object
debut           datetime64[ns]
finalGame       datetime64[ns]
retroID                 object
bbrefID                 object
dtype: object

In [16]:
people.head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


## <font color="red"> Exercise 2 </font>
**Goal:** Find a method/attribute of each `dtype`, which preferably returns something immutable like a `str`, that we can use to identify the general type.
**Tasks:**

1. Pull off an example `dtype`
2. Use `dir` to inspect the available methods
3. Test the methods/attributes to find a good candidate.

In [28]:
# Your code here

## <font color="blue"> Key </font>

In [22]:
from toolz import first
example_type = first(people.dtypes)
example_type

dtype('O')

In [24]:
[m for m in dir(example_type) if not m.startswith('_')]

['alignment',
 'base',
 'byteorder',
 'char',
 'descr',
 'fields',
 'flags',
 'hasobject',
 'isalignedstruct',
 'isbuiltin',
 'isnative',
 'itemsize',
 'kind',
 'metadata',
 'name',
 'names',
 'ndim',
 'newbyteorder',
 'num',
 'shape',
 'str',
 'subdtype',
 'type']

In [27]:
[t.kind for t in people.dtypes]

['O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'f',
 'f',
 'O',
 'O',
 'M',
 'M',
 'O',
 'O']

In [18]:
[dt.kind for dt in people.dtypes]

['O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'i',
 'i',
 'i',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'f',
 'f',
 'O',
 'O',
 'M',
 'M',
 'O',
 'O']

## Creating a type conversion dictionary

1. keys will be the `dtype.kind` strings
2. The `values` will be the associated `sqlalchemy` types

In [31]:
from sqlalchemy import Integer, Float, String, DateTime
DTYPES_TO_SQLALCHEMY_TYPES = {'O':String,
                              'i':Integer,
                              'f':Float,
                              'M':DateTime}
DTYPES_TO_SQLALCHEMY_TYPES

{'O': sqlalchemy.sql.sqltypes.String,
 'i': sqlalchemy.sql.sqltypes.Integer,
 'f': sqlalchemy.sql.sqltypes.Float,
 'M': sqlalchemy.sql.sqltypes.DateTime}

## Use ALL CAPS for global constants

When dealing with global constants, we should

1. Define them at the top of the file.
2. Use an ALL CAPS name to make them stand out.

## <font color="red"> Exercise 3 </font>

Write a `dict` comprehension that uses our conversion `dict` to convert the `pandas` `dtypes` to `sqlalchemy` types.

In [35]:
# Your code here

## <font color="blue"> Key </font>

In [33]:
from toolz import get
{col:get(dtype.kind, DTYPES_TO_SQLALCHEMY_TYPES) 
 for col, dtype in zip(people.columns, people.dtypes)}

{'playerID': sqlalchemy.sql.sqltypes.String,
 'birthYear': sqlalchemy.sql.sqltypes.Integer,
 'birthMonth': sqlalchemy.sql.sqltypes.Integer,
 'birthDay': sqlalchemy.sql.sqltypes.Integer,
 'birthCountry': sqlalchemy.sql.sqltypes.String,
 'birthState': sqlalchemy.sql.sqltypes.String,
 'birthCity': sqlalchemy.sql.sqltypes.String,
 'deathYear': sqlalchemy.sql.sqltypes.Integer,
 'deathMonth': sqlalchemy.sql.sqltypes.Integer,
 'deathDay': sqlalchemy.sql.sqltypes.Integer,
 'deathCountry': sqlalchemy.sql.sqltypes.String,
 'deathState': sqlalchemy.sql.sqltypes.String,
 'deathCity': sqlalchemy.sql.sqltypes.String,
 'nameFirst': sqlalchemy.sql.sqltypes.String,
 'nameLast': sqlalchemy.sql.sqltypes.String,
 'nameGiven': sqlalchemy.sql.sqltypes.String,
 'weight': sqlalchemy.sql.sqltypes.Float,
 'height': sqlalchemy.sql.sqltypes.Float,
 'bats': sqlalchemy.sql.sqltypes.String,
 'throws': sqlalchemy.sql.sqltypes.String,
 'debut': sqlalchemy.sql.sqltypes.DateTime,
 'finalGame': sqlalchemy.sql.sqltypes.Da

## <font color="red"> Exercise 4 </font>

Package your expression in a `lambda` and refactor your code by adding helper functions to clean up the expression.

In [None]:
# lambda function before refactoring

In [37]:
# refactored lambda function and helper functions

## <font color="blue"> Key </font>

In [38]:
get_sql_types = lambda df: {col:DTYPES_TO_SQLALCHEMY_TYPES[dtype.kind] 
                            for col, dtype in zip(people.columns, people.dtypes)}

In [35]:
def get_sql_types(df):
    sql_type = lambda dtype: DTYPES_TO_SQLALCHEMY_TYPES[dtype.kind] 
    cols_and_dtypes = lambda df: zip(df.columns, df.dtypes)
    return {col:sql_type(dtype) 
            for col, dtype in cols_and_dtypes(df)}

In [36]:
get_sql_types(people)

{'playerID': sqlalchemy.sql.sqltypes.String,
 'birthYear': sqlalchemy.sql.sqltypes.Integer,
 'birthMonth': sqlalchemy.sql.sqltypes.Integer,
 'birthDay': sqlalchemy.sql.sqltypes.Integer,
 'birthCountry': sqlalchemy.sql.sqltypes.String,
 'birthState': sqlalchemy.sql.sqltypes.String,
 'birthCity': sqlalchemy.sql.sqltypes.String,
 'deathYear': sqlalchemy.sql.sqltypes.Integer,
 'deathMonth': sqlalchemy.sql.sqltypes.Integer,
 'deathDay': sqlalchemy.sql.sqltypes.Integer,
 'deathCountry': sqlalchemy.sql.sqltypes.String,
 'deathState': sqlalchemy.sql.sqltypes.String,
 'deathCity': sqlalchemy.sql.sqltypes.String,
 'nameFirst': sqlalchemy.sql.sqltypes.String,
 'nameLast': sqlalchemy.sql.sqltypes.String,
 'nameGiven': sqlalchemy.sql.sqltypes.String,
 'weight': sqlalchemy.sql.sqltypes.Float,
 'height': sqlalchemy.sql.sqltypes.Float,
 'bats': sqlalchemy.sql.sqltypes.String,
 'throws': sqlalchemy.sql.sqltypes.String,
 'debut': sqlalchemy.sql.sqltypes.DateTime,
 'finalGame': sqlalchemy.sql.sqltypes.Da

## <font color="red"> Exercise 5 </font>

Add the `People.csv` to your `baseball.db`

In [None]:
# lambda function before refactoring

In [40]:
# refactored lambda function and helper functions

## <font color="blue"> Key </font>

#### Creating the `sqlalchemy` type `dict`

In [44]:
sql_types = get_sql_types(people)
sql_types

{'playerID': sqlalchemy.sql.sqltypes.String,
 'birthYear': sqlalchemy.sql.sqltypes.Integer,
 'birthMonth': sqlalchemy.sql.sqltypes.Integer,
 'birthDay': sqlalchemy.sql.sqltypes.Integer,
 'birthCountry': sqlalchemy.sql.sqltypes.String,
 'birthState': sqlalchemy.sql.sqltypes.String,
 'birthCity': sqlalchemy.sql.sqltypes.String,
 'deathYear': sqlalchemy.sql.sqltypes.Integer,
 'deathMonth': sqlalchemy.sql.sqltypes.Integer,
 'deathDay': sqlalchemy.sql.sqltypes.Integer,
 'deathCountry': sqlalchemy.sql.sqltypes.String,
 'deathState': sqlalchemy.sql.sqltypes.String,
 'deathCity': sqlalchemy.sql.sqltypes.String,
 'nameFirst': sqlalchemy.sql.sqltypes.String,
 'nameLast': sqlalchemy.sql.sqltypes.String,
 'nameGiven': sqlalchemy.sql.sqltypes.String,
 'weight': sqlalchemy.sql.sqltypes.Float,
 'height': sqlalchemy.sql.sqltypes.Float,
 'bats': sqlalchemy.sql.sqltypes.String,
 'throws': sqlalchemy.sql.sqltypes.String,
 'debut': sqlalchemy.sql.sqltypes.DateTime,
 'finalGame': sqlalchemy.sql.sqltypes.Da

#### Checking if `playerID` is a unique primary key

In [45]:
people.playerID.is_unique

True

#### Creating an `engine` and `schema`

In [47]:
from sqlalchemy import create_engine
mang_eng = create_engine("sqlite:///databases/baseball.db")
mang_eng.echo = False
schema = pd.io.sql.get_schema(people, 'people', keys='playerID', con=mang_eng, dtype=sql_types)
print(schema)


CREATE TABLE people (
	"playerID" VARCHAR NOT NULL, 
	"birthYear" INTEGER, 
	"birthMonth" INTEGER, 
	"birthDay" INTEGER, 
	"birthCountry" VARCHAR, 
	"birthState" VARCHAR, 
	"birthCity" VARCHAR, 
	"deathYear" INTEGER, 
	"deathMonth" INTEGER, 
	"deathDay" INTEGER, 
	"deathCountry" VARCHAR, 
	"deathState" VARCHAR, 
	"deathCity" VARCHAR, 
	"nameFirst" VARCHAR, 
	"nameLast" VARCHAR, 
	"nameGiven" VARCHAR, 
	weight FLOAT, 
	height FLOAT, 
	bats VARCHAR, 
	throws VARCHAR, 
	debut DATETIME, 
	"finalGame" DATETIME, 
	"retroID" VARCHAR, 
	"bbrefID" VARCHAR, 
	CONSTRAINT people_pk PRIMARY KEY ("playerID")
)




In [48]:
mang_eng.execute(schema)

<sqlalchemy.engine.result.ResultProxy at 0x119066828>

In [49]:
people.to_sql('people', 
                con=mang_eng, 
                dtype=sql_types, 
                index=False,
                if_exists='append')

In [50]:
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import select

mang_eng2 = create_engine("sqlite:///databases/baseball.db") 
Session = sessionmaker(mang_eng)
session = Session()

In [51]:
Base = automap_base()
Base.prepare(mang_eng2, reflect=True)
People = Base.classes.people

In [53]:
from more_sqlalchemy import result_dicts
stmt = select('*').select_from(People)
session.execute(stmt).fetchmany(5) >> result_dicts

[{'playerID': 'aardsda01',
  'birthYear': 1981,
  'birthMonth': 12,
  'birthDay': 27,
  'birthCountry': 'USA',
  'birthState': 'CO',
  'birthCity': 'Denver',
  'deathYear': None,
  'deathMonth': None,
  'deathDay': None,
  'deathCountry': None,
  'deathState': None,
  'deathCity': None,
  'nameFirst': 'David',
  'nameLast': 'Aardsma',
  'nameGiven': 'David Allan',
  'weight': 215.0,
  'height': 75.0,
  'bats': 'R',
  'throws': 'R',
  'debut': '2004-04-06 00:00:00.000000',
  'finalGame': '2015-08-23 00:00:00.000000',
  'retroID': 'aardd001',
  'bbrefID': 'aardsda01'},
 {'playerID': 'aaronha01',
  'birthYear': 1934,
  'birthMonth': 2,
  'birthDay': 5,
  'birthCountry': 'USA',
  'birthState': 'AL',
  'birthCity': 'Mobile',
  'deathYear': None,
  'deathMonth': None,
  'deathDay': None,
  'deathCountry': None,
  'deathState': None,
  'deathCity': None,
  'nameFirst': 'Hank',
  'nameLast': 'Aaron',
  'nameGiven': 'Henry Louis',
  'weight': 180.0,
  'height': 72.0,
  'bats': 'R',
  'throws': 

## <font color="red"> Exercise 6 </font>

Set up a similar automatic conversion precess for `pyspark.DataFrames`.  You need to build a structure like this:

```python
schema = StructType([StructField('Name', StringType(), True),
                     StructField('DateTime', TimestampType(), True)
                     StructField('Age', IntegerType(), True)])
```