## About Dataset

Reddit is a discussion website which users can post images and text in a subforum called subreddit which users can discuss about shared contents in comment section. This dataset contains 05/2015 comment submissions from reddit users.

I got my data from kaggle unfornutely this dataset is too big to run on kaggle so I needed to download it.
> https://www.kaggle.com/reddit/reddit-comments-may-2015/notebooks

If you want a JSON format of this data you can download it from: https://files.pushshift.io/reddit/comments/

## Accessing data from sqlite and cleaning it

Used this sqlite query to clean the dataset before extracting it to csv because it caused problems while trying to import the data
```sqlite
create table reddit_2015_05 as
select 
rd.created_utc,
rd.ups,
rd.subreddit_id,
rd.link_id,
rd.name,
rd.score_hidden,
replace(
	replace(
		replace(
				replace(
					replace(
						replace(
							replace(rd.author_flair_text,'\','')
						,'*','')
					,'#','')
				, X'0A', ' ')
		,char(13),' ')
	,';','')
,'"','') as author_flair_text,
rd.subreddit,
rd.id,
rd.removal_reason,
rd.gilded,
rd.downs,
rd.archived,
rd.author,
rd.score,
rd.retrieved_on,
replace(
	replace(
		replace(
				replace(
					replace(
						replace(
							replace(rd.body,'\','')
						,'*','')
					,'#','')
				, X'0A', ' ')
		,char(13),' ')
	,';','')
,'"','') as body,
rd.distinguished,
rd.edited,
rd.controversiality,
rd.parent_id
from may2015 rd;
```

## Splitting csv data to make it ready for import

I needed to split my csv file so I can import it to PostgreSQL because PostgreSQL copy command doesn't support files bigger than 4GB

I used [csvsplitter](https://www.erdconcepts.com/dbtoolbox/csvsplitter/csvsplitter.zip) from [erdconcepts](https://www.erdconcepts.com/dbtoolbox.html)

Opened up cmd and inserted these lines;

```cmd
cd C:\data\reddit\csvsplitter

CSVSplitter.exe filename="C:\data\reddit\reddit_2015_05.csv" rowcount=5000000
```

It spliced my csv to 11 files ranging from 1.2GB to 1.5GB

## Creating table in PostgreSQL to import our dataset

I created my PostgreSQL table with this query

```PostgreSQL
CREATE TABLE "ODS"."EXT_REDDIT_COMMENTS"
(
    created_utc integer,
    ups integer,
    subreddit_id text COLLATE pg_catalog."default",
    link_id text COLLATE pg_catalog."default",
    name text COLLATE pg_catalog."default",
    score_hidden text COLLATE pg_catalog."default",
    author_flair_text text COLLATE pg_catalog."default",
    subreddit text COLLATE pg_catalog."default",
    id text COLLATE pg_catalog."default",
    removal_reason text COLLATE pg_catalog."default",
    gilded integer,
    downs integer,
    archived text COLLATE pg_catalog."default",
    author text COLLATE pg_catalog."default",
    score integer,
    retrieved_on integer,
    body text COLLATE pg_catalog."default",
    distinguished text COLLATE pg_catalog."default",
    edited text COLLATE pg_catalog."default",
    controversiality integer,
    parent_id text COLLATE pg_catalog."default"
)

TABLESPACE pg_default;

ALTER TABLE "ODS"."EXT_REDDIT_COMMENTS"
    OWNER to postgres;
```

## Importing dataset

Then used PostgreSQL copy command to import my data;

```PostgreSQL
SET STATEMENT_TIMEOUT TO 3000000;

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-000.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-001.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-002.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-003.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-004.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-005.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-006.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-007.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-008.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-009.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-010.CSV' DELIMITER ';';

COMMIT;
```

In [8]:
import pandas as pd
import numpy as np
import psycopg2

conn_string = 'host={pghost} port={pgport} dbname={pgdatabase} user={pguser} password={pgpassword}'.format(pgdatabase='MEF-BDA-PROD',pguser='postgres',pgpassword='123',pghost='localhost',pgport='5432')
conn=psycopg2.connect(conn_string)
cur=conn.cursor()

def check_if_table_exists(schema,table):
    cur=conn.cursor()
    cur.execute("select exists(select * from information_schema.tables where table_schema='{schema}' AND table_name='{table}')".format(schema=schema, table=table))
    return cur.fetchone()[0]

print(check_if_table_exists('EDW','DWH_REDDIT_COMMENTS'))

#sql_command = 'CREATE TABLE "EDW"."DWH_REDDIT_COMMENTS" AS SELECT * FROM "ODS"."EXT_REDDIT_COMMENTS" LIMIT 200;'

#cur.execute(sql_command)

#cur.execute('COMMIT;')

#sql_command = 'SELECT * FROM "{schema}"."{table}";'.format(schema='ODS', table='EXT_REDDIT_COMMENTS')

cur.close()

sql_command = 'SELECT * FROM "{schema}"."{table}";'.format(schema='EDW', table='DWH_REDDIT_COMMENTS')

df = pd.read_sql(sql_command,conn)

conn.close()

#print(df.memory_usage(index=True).sum())

df

False


Unnamed: 0,created_utc,ups,subreddit_id,link_id,name,score_hidden,author_flair_text,subreddit,id,removal_reason,...,downs,archived,author,score,retrieved_on,body,distinguished,edited,controversiality,parent_id
0,1430655728,2,t5_2qw2g,t3_34p2bd,t1_cqwvfop,0,,gamemaker,cqwvfop,,...,0,0,ZeCatox,2,1432744816,Rank C ! That felt tough ! but maybe not so mu...,,0,0,t3_34p2bd
1,1430655728,3,t5_2xdca,t3_34okth,t1_cqwvfoq,0,Rias and Akeno &lt3,HighschoolDxD,cqwvfoq,,...,0,0,Mentalorder,3,1432744816,"Yeah me too, but doesn't have to do with anime...",,0,0,t1_cqwv9tq
2,1430655728,2,t5_2qqjc,t3_34om4a,t1_cqwvfor,0,,todayilearned,cqwvfor,,...,0,0,amornglor,2,1432744816,"He was never parodied on Seinfeld, though.",,0,0,t1_cqwvf62
3,1430655728,2,t5_2qh61,t3_34oih0,t1_cqwvfos,0,,WTF,cqwvfos,,...,0,0,Koolaidwifebeater,2,1432744816,Pocket excitement!,,0,0,t1_cqwsum1
4,1430655728,-2,t5_31c2x,t3_34jbrk,t1_cqwvfot,0,,enoughsandersspam,cqwvfot,,...,0,0,toomuchspamb,-2,1432744816,https://www.opensecrets.org/politicians/summar...,,0,1,t1_cqwr50k
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,1430655745,1,t5_2rnei,t3_34ox0d,t1_cqwvfu2,0,,Comcast,cqwvfu2,,...,0,0,firedfromcomcast,1,1432744818,What did I just read?,,0,0,t3_34ox0d
196,1430655745,2,t5_2rjli,t3_34nlc2,t1_cqwvfu3,0,17,teenagers,cqwvfu3,,...,0,0,prettyrandomusername,2,1432744818,Do something to make yourself proud. Take up a...,,0,0,t1_cqwnr6r
197,1430655745,3,t5_2u3tn,t3_34obyw,t1_cqwvfu4,0,,SSBPM,cqwvfu4,,...,0,0,adelrune,3,1432744818,Are you using the wifi-safe version ? I found ...,,0,0,t3_34obyw
198,1430655745,2,t5_2sa8b,t3_34ll0x,t1_cqwvfu5,0,,KitchenConfidential,cqwvfu5,,...,0,0,dirtjuggalo,2,1432744818,Sounds like a few of the places I've worked in...,,0,0,t1_cqwt4aw


## Sources:

 1. [About Reddit](https://en.wikipedia.org/wiki/Reddit)

 2. [Data source](https://www.kaggle.com/reddit/reddit-comments-may-2015/notebooks)

 3. [Checking if a table exist with psycopg2 on postgreSQL](https://stackoverflow.com/questions/1874113/checking-if-a-postgresql-table-exists-under-python-and-probably-psycopg2)