 <font color='blue'> Date: 20210314

 <font color='blue'> POC: Siyu Liu

 <font color='blue'>Updates:

* added netflix, disney+, hulu, hbomax, platform info scripted data to db.
* did not include twitter as the schema is still subject to change.



## Instructions
#### Install sqlite
download and install the sqlite from https://www.sqlite.org/download.html

or if you are using conda env run the command: conda install -c anaconda sqlite

#### Note:
##### db naming convention:
##### project_s +  summary + v_version + timestamp
    

e.g.
'project_s_sample_v0.0_20210221'

In [1]:
import pandas as pd
import sqlite3
from pathlib import Path
import os

import sys
sys.path.insert(1,'../utils/')
from db_utils import *

In [2]:
!ls ../utils/

db_utils.py  dtypes_utils.py  __pycache__


# Config

In [3]:
db_name = 'project_s_instagram_v1_20210314.db'
conn = sqlite3.connect(db_name)
c = conn.cursor()

# Platform I. Netflix

##  <font color='blue'> Step 1 data loading

In [4]:
netflix_address = '../data/instagram_netflix'

netflix_static, netflix_tracking = load_data(netflix_address)

loading data......
netflix_post_general_tracking_info.csv
netflix_post_general_static_info.csv


##  <font color='blue'> Step 2 write to db

In [5]:
## Create DB

table_name_netflix_static = 'instagram_netflix_static'
Path(db_name).touch()

table_name_netflix_tracking = 'instagram_netflix_tracking'
Path(db_name).touch()

In [6]:
### write static data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          hrefs varchar,
                          short_codes varchar,
                          post_types varchar,
                          captions varchar,
                          accessibility_caption varchar,
                          id int,
                          timestamp int,
                          upload_date timestamp,
                          upload_date_text varchar
                          )'''.format(table_name = table_name_netflix_static))

# write csv to table
netflix_static.to_sql(table_name_netflix_static, 
                      conn, 
                      if_exists='append',
                      index=False)

In [7]:
### write tracking data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          short_codes varchar,
                          number_of_likes int,
                          number_of_video_views int,
                          number_of_comments int
                          )'''.format(table_name = table_name_netflix_tracking))

# write csv to table
netflix_tracking.to_sql(table_name_netflix_tracking, 
                      conn, 
                      if_exists='append',
                      index=False)

##  <font color='blue'> Step 3 Usage

### Note: To standardize the data format checkout the demo in ~/demo/ folder

In [8]:
### querying from the db
q = """
    SELECT * FROM instagram_netflix_tracking a
    JOIN instagram_netflix_static b
    ON a.short_codes = b.short_codes
    """
netflix = pd.read_sql(q, conn)
netflix.head(2)

Unnamed: 0,fetch_date,short_codes,number_of_likes,number_of_video_views,number_of_comments,fetch_date.1,hrefs,short_codes.1,post_types,captions,accessibility_caption,id,timestamp,upload_date,upload_date_text
0,03/06/21,CL4c2Fjl0tq,768179,,8548,03/06/21,https://www.instagram.com/p/CL4c2Fjl0tq/,CL4c2Fjl0tq,Image,"Photo by Netflix US on March 01, 2021. May be ...","Photo by Netflix US on March 01, 2021. May be ...",2519890853633674090,1614614420,2021-03-01T16:00:20.000Z,"Mar 1, 2021"
1,03/06/21,CL4c2Fjl0tq,768179,,8548,03/06/21,https://www.instagram.com/p/CL4c2Fjl0tq/,CL4c2Fjl0tq,Image,"Photo by Netflix US on March 01, 2021. May be ...","Photo by Netflix US on March 01, 2021. May be ...",2519890853633674090,1614614420,2021-03-01T16:00:20.000Z,"Mar 1, 2021"


# Platform II. Disney +

In [9]:
disney_address = '../data/instagram_disney'

disney_static, disney_tracking = load_data(disney_address)

loading data......
disney+_post_general_static_info.csv
disney+_post_general_tracking_info.csv


In [10]:
## Create table

table_name_disney_static = 'instagram_disney_static'
Path(db_name).touch()

table_name_disney_tracking = 'instagram_disney_tracking'
Path(db_name).touch()

### write static data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          hrefs varchar,
                          short_codes varchar,
                          post_types varchar,
                          captions varchar,
                          accessibility_caption varchar,
                          id int,
                          timestamp int,
                          upload_date timestamp,
                          upload_date_text varchar
                          )'''.format(table_name = table_name_disney_static))

# write csv to table
disney_static.to_sql(table_name_disney_static, 
                      conn, 
                      if_exists='append',
                      index=False)

### write tracking data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          short_codes varchar,
                          number_of_likes int,
                          number_of_video_views int,
                          number_of_comments int
                          )'''.format(table_name = table_name_disney_tracking))

# write csv to table
disney_tracking.to_sql(table_name_disney_tracking, 
                      conn, 
                      if_exists='append',
                      index=False)

In [11]:
### querying from the db
q = """
    SELECT * FROM instagram_disney_tracking a
    JOIN instagram_disney_static b
    ON a.short_codes = b.short_codes
    """
disney = pd.read_sql(q, conn)
disney.head(2)

Unnamed: 0,fetch_date,short_codes,number_of_likes,number_of_video_views,number_of_comments,fetch_date.1,hrefs,short_codes.1,post_types,captions,accessibility_caption,id,timestamp,upload_date,upload_date_text
0,03/11/21,CMLw1gys2bz,52399,160923.0,162,03/11/21,https://www.instagram.com/p/CMLw1gys2bz/,CMLw1gys2bz,Video,Go back to the beginning and experience their ...,,2525326799646451443,1615262459,2021-03-09T04:00:59.000Z,"Mar 8, 2021"
1,03/11/21,CMLw1gys2bz,52399,160923.0,162,03/11/21,https://www.instagram.com/p/CMLw1gys2bz/,CMLw1gys2bz,Video,Go back to the beginning and experience their ...,,2525326799646451443,1615262459,2021-03-09T04:00:59.000Z,"Mar 8, 2021"


# Platform III. Hulu

In [12]:
hulu_address = '../data/instagram_hulu'

hulu_static, hulu_tracking = load_data(hulu_address)

loading data......
hulu_post_general_static_info.csv
hulu_post_general_tracking_info.csv


In [13]:
## Create table

table_name_hulu_static = 'instagram_hulu_static'
Path(db_name).touch()

table_name_hulu_tracking = 'instagram_hulu_tracking'
Path(db_name).touch()

### write static data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          hrefs varchar,
                          short_codes varchar,
                          post_types varchar,
                          captions varchar,
                          accessibility_caption varchar,
                          id int,
                          timestamp int,
                          upload_date timestamp,
                          upload_date_text varchar
                          )'''.format(table_name = table_name_hulu_static))

# write csv to table
hulu_static.to_sql(table_name_hulu_static, 
                      conn, 
                      if_exists='append',
                      index=False)

### write tracking data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          short_codes varchar,
                          number_of_likes int,
                          number_of_video_views int,
                          number_of_comments int
                          )'''.format(table_name = table_name_hulu_tracking))

# write csv to table
hulu_tracking.to_sql(table_name_hulu_tracking, 
                      conn, 
                      if_exists='append',
                      index=False)

In [14]:
### querying from the db
q = """
    SELECT * FROM instagram_hulu_tracking a
    JOIN instagram_hulu_static b
    ON a.short_codes = b.short_codes
    """
hulu = pd.read_sql(q, conn)
hulu.head(2)

Unnamed: 0,fetch_date,short_codes,number_of_likes,number_of_video_views,number_of_comments,fetch_date.1,hrefs,short_codes.1,post_types,captions,accessibility_caption,id,timestamp,upload_date,upload_date_text
0,03/11/21,CMP44ARBOnZ,1137,,12,03/11/21,https://www.instagram.com/p/CMP44ARBOnZ/,CMP44ARBOnZ,Carousel,"Photo shared by HuluStreamKween90 on March 10,...","Photo shared by HuluStreamKween90 on March 10,...",2526488055158991321,1615400867,2021-03-10T18:27:47.000Z,"Mar 10, 2021"
1,03/11/21,CMP44ARBOnZ,1137,,12,03/11/21,https://www.instagram.com/p/CMP44ARBOnZ/,CMP44ARBOnZ,Carousel,"Photo shared by HuluStreamKween90 on March 10,...","Photo shared by HuluStreamKween90 on March 10,...",2526488055158991321,1615400867,2021-03-10T18:27:47.000Z,"Mar 10, 2021"


# Platform IV. hbomax

In [15]:
hbomax_address = '../data/instagram_hbomax'

hbomax_static, hbomax_tracking = load_data(hbomax_address)

loading data......
hbomax_post_general_tracking_info.csv
hbomax_post_general_static_info.csv


In [16]:
## Create table

table_name_hbomax_static = 'instagram_hbomax_static'
Path(db_name).touch()

table_name_hbomax_tracking = 'instagram_hbomax_tracking'
Path(db_name).touch()

### write static data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          hrefs varchar,
                          short_codes varchar,
                          post_types varchar,
                          captions varchar,
                          accessibility_caption varchar,
                          id int,
                          timestamp int,
                          upload_date timestamp,
                          upload_date_text varchar
                          )'''.format(table_name = table_name_hbomax_static))

# write csv to table
hbomax_static.to_sql(table_name_hbomax_static, 
                      conn, 
                      if_exists='append',
                      index=False)

### write tracking data

# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          short_codes varchar,
                          number_of_likes int,
                          number_of_video_views int,
                          number_of_comments int
                          )'''.format(table_name = table_name_hbomax_tracking))

# write csv to table
hbomax_tracking.to_sql(table_name_hbomax_tracking, 
                      conn, 
                      if_exists='append',
                      index=False)

In [17]:
### querying from the db
q = """
    SELECT * FROM instagram_hbomax_tracking a
    JOIN instagram_hbomax_static b
    ON a.short_codes = b.short_codes
    """
hbomax = pd.read_sql(q, conn)
hbomax.head(2)

Unnamed: 0,fetch_date,short_codes,number_of_likes,number_of_video_views,number_of_comments,fetch_date.1,hrefs,short_codes.1,post_types,captions,accessibility_caption,id,timestamp,upload_date,upload_date_text
0,03/11/21,CMK2rTdhOGy,4478,,64,03/11/21,https://www.instagram.com/p/CMK2rTdhOGy/,CMK2rTdhOGy,Carousel,"Photo by HBO Max on March 08, 2021. May be a c...","Photo by HBO Max on March 08, 2021. May be a c...",2525071011440026034,1615231943,2021-03-08T19:32:23.000Z,"Mar 8, 2021"
1,03/11/21,CMK2rTdhOGy,4478,,64,03/11/21,https://www.instagram.com/p/CMK2rTdhOGy/,CMK2rTdhOGy,Carousel,"Photo by HBO Max on March 08, 2021. May be a c...","Photo by HBO Max on March 08, 2021. May be a c...",2525071011440026034,1615231943,2021-03-08T19:32:23.000Z,"Mar 8, 2021"


# Platform Page Info

In [18]:
platform_info_address = '../data/instagram_platform_page_info'

platform_info = load_data(platform_info_address, kind = 'general')

loading data......
general_info_among_platforms.csv


In [19]:
### create table
table_name_platform_info = 'instagram_platform_page_info'
Path(db_name).touch()


# initialize schema
c.execute('''CREATE TABLE IF NOT EXISTS {table_name}
                         (fetch_date timestamp,
                          streaming_platform varchar,
                          hrefs varchar,
                          page_ids int,
                          page_bios varchar,
                          follower_count int,
                          following_count int,
                          post_number int
                          )'''.format(table_name = table_name_platform_info))

# write csv to table
platform_info.to_sql(table_name_platform_info, 
                      conn, 
                      if_exists='append',
                      index=False)

In [20]:
### querying from the db
q = """
    SELECT * FROM instagram_platform_page_info
    """
platform = pd.read_sql(q, conn)
platform.head(2)

Unnamed: 0,fetch_date,streaming_platform,hrefs,page_ids,page_bios,follower_count,following_count,post_number
0,03/09/21,netflix,https://www.instagram.com/netflix/,207587378,@strongblacklead every month is Black history ...,26577084,977,3613
1,03/09/21,disneyplus,https://www.instagram.com/disneyplus/,7522677467,#WomenBeyondLimits: Celebrate their inspiring ...,3933952,208,1220
