# Big Data
How can we use `Boto` to handle large datasets?  

## `Boto`

### Grab `Churn_Modelling.csv` from AWS S3

In [2]:
import pandas as pd
import boto3

# make sure AWS is configured before getting objects from S3

bucket = "make-school-data"
file_name = "data/Churn_Modelling.csv"

# create connection to S3 using default config 
s3 = boto3.client("s3")

# get object and file from bucket
obj = s3.get_object(Bucket=bucket, Key=file_name)

df = pd.read_csv(obj["Body"])
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## SQL Practice

In [44]:
import sqlite3 as lite

con = lite.connect('population.db')

with con:
    cur = con.cursor()
    cur.execute("CREATE TABLE Population(id INTEGER PRIMARY KEY, country TEXT, population INT)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Germany',81197537)")
    cur.execute("INSERT INTO Population VALUES(NULL,'France', 66415161)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Italy', 60795612)")
    cur.execute("INSERT INTO Population VALUES(NULL,'Spain', 46439864)")

In [46]:
# import os
# os.remove("population.db")

# cur.execute("DROP TABLE Population")

Write a SQL syntax in Python that return all records where population field is greater or equal than 50M

In [45]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('population.db')
query = "SELECT country FROM Population WHERE population > 50000000;"

df = pd.read_sql_query(query, conn)

for country in df['country']:
    print(country)

Germany
France
Italy


Write a SQL syntax in Python that returns all countries that start with 'S'

In [50]:
query = "SELECT country FROM Population WHERE country LIKE 'S%'"
df = pd.read_sql_query(query, conn)

for country in df['country']:
    print(country)

Spain
Spain


## NoSQL

1. What is the name of table in NoSQL domain? -- Collection
2. What is the name of record in NoSQL domain? -- Document
3. What is the data type structure for a document in NoSQL? -- Dictionary 

## APIs

Get the top 10 repositories listed under "tensorflow"

In [64]:
import requests

url = "https://api.github.com/search/repositories?q=tensorflow"

r = requests.get(url).json()

for repo in r["items"][:10]:
    print(repo["full_name"])

tensorflow/tensorflow
romeokienzler/TensorFlow
aymericdamien/TensorFlow-Examples
czy36mengfei/tensorflow2_tutorials_chinese
jikexueyuanwiki/tensorflow-zh
jtoy/awesome-tensorflow
tensorflow/models
yao62995/tensorflow
lyhue1991/eat_tensorflow2_in_30_days
tensorflow/docs


In [69]:
r["items"][0].keys()

dict_keys(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url', 'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size', 'stargazers_count', 'watchers_count', 'language', 'has_issues', 'has_projects', 'has_downloads', 'has_wiki', 'has_pages', 'forks_count', 'mirror_url', 'archived', 'disabled', 'open_issues_count', 'lic