## About Dataset

Reddit is a discussion website which users can post images and text in a subforum called subreddit which users can discuss about shared contents in comment section. This dataset contains 05/2015 comment submissions from reddit users with 54.504.410 rows and 22 columns.

I got my data from kaggle unfornutely this dataset is too big to run on kaggle so I needed to download it.
> https://www.kaggle.com/reddit/reddit-comments-may-2015/notebooks

If you want a JSON format of this data you can download it from: https://files.pushshift.io/reddit/comments/

## Accessing data from sqlite and cleaning it

Used this sqlite query to clean the dataset before extracting it to csv because it caused problems while trying to import the data

I didn't import authorflaircss_class field because it is not important for our analysis

```sqlite
create table reddit_2015_05 as
select 
rd.created_utc,
rd.ups,
rd.subreddit_id,
rd.link_id,
rd.name,
rd.score_hidden,
replace(
	replace(
		replace(
				replace(
					replace(
						replace(
							replace(rd.author_flair_text,'\','')
						,'*','')
					,'#','')
				, X'0A', ' ')
		,char(13),' ')
	,';','')
,'"','') as author_flair_text,
rd.subreddit,
rd.id,
rd.removal_reason,
rd.gilded,
rd.downs,
rd.archived,
rd.author,
rd.score,
rd.retrieved_on,
replace(
	replace(
		replace(
				replace(
					replace(
						replace(
							replace(rd.body,'\','')
						,'*','')
					,'#','')
				, X'0A', ' ')
		,char(13),' ')
	,';','')
,'"','') as body,
rd.distinguished,
rd.edited,
rd.controversiality,
rd.parent_id
from may2015 rd;
```

## Splitting csv data to make it ready for import

I needed to split my csv file so I can import it to PostgreSQL because PostgreSQL copy command doesn't support files bigger than 4GB

I used [csvsplitter](https://www.erdconcepts.com/dbtoolbox/csvsplitter/csvsplitter.zip) from [erdconcepts](https://www.erdconcepts.com/dbtoolbox.html)

Opened up cmd and inserted these lines;

```cmd
cd C:\data\reddit\csvsplitter

CSVSplitter.exe filename="C:\data\reddit\reddit_2015_05.csv" rowcount=5000000
```

It spliced my csv to 11 files ranging from 1.2GB to 1.5GB

## Creating table in PostgreSQL to import our dataset

I created my PostgreSQL table with this query

```PostgreSQL
CREATE TABLE "ODS"."EXT_REDDIT_COMMENTS"
(
    created_utc integer,
    ups integer,
    subreddit_id text COLLATE pg_catalog."default",
    link_id text COLLATE pg_catalog."default",
    name text COLLATE pg_catalog."default",
    score_hidden text COLLATE pg_catalog."default",
    author_flair_text text COLLATE pg_catalog."default",
    subreddit text COLLATE pg_catalog."default",
    id text COLLATE pg_catalog."default",
    removal_reason text COLLATE pg_catalog."default",
    gilded integer,
    downs integer,
    archived text COLLATE pg_catalog."default",
    author text COLLATE pg_catalog."default",
    score integer,
    retrieved_on integer,
    body text COLLATE pg_catalog."default",
    distinguished text COLLATE pg_catalog."default",
    edited text COLLATE pg_catalog."default",
    controversiality integer,
    parent_id text COLLATE pg_catalog."default"
)

TABLESPACE pg_default;

ALTER TABLE "ODS"."EXT_REDDIT_COMMENTS"
    OWNER to postgres;
```

## Importing dataset

Then used PostgreSQL copy command to import my data;

```PostgreSQL
SET STATEMENT_TIMEOUT TO 3000000;

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-000.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-001.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-002.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-003.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-004.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-005.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-006.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-007.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-008.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-009.CSV' DELIMITER ';';

COPY "ODS"."EXT_REDDIT_COMMENTS" FROM 'C:/data/reddit/REDDIT_2015_05-010.CSV' DELIMITER ';';

COMMIT;
```

## Analyzing our data

Original dataset is too big to handle(54.504.410 rows with 33.3GB size) maybe we should check if it is possible to reduce our data while not affecting our analysis

```PostgreSQL
SELECT 
COUNT(*)                       
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2;
```
This query reduces our data to 54.333.604 rows while removing comments like 'OK' which is not meaningful on its own.

```PostgreSQL
SELECT 
COUNT(*)                       
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%');
```
This would remove 958 bot comments with comment author names contains "-bot-" or "_bot_", it is not that a huge decrease.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''',''))) LIKE '%im a bot%';
```
We could also filter comments with "I'm a bot" text, this also decreases dataset with 24.918 rows.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
AND ERS.BODY <> '[deleted]';
```
This query removes deleted comments which is 3.138.587 rows.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
AND ERS.BODY <> '[deleted]'
AND LENGTH(ERS.REMOVAL_REASON) = 0;
```
We should also remove removed comments which is replaced by removal reason instead of original comments.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
AND ERS.BODY <> '[deleted]'
AND LENGTH(ERS.removal_reason) = 0
AND ERS.BODY LIKE '% %';
```
We should remove single word comments(1.885.966 rows) because they are not important for our analysis.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT (LOWER(ERS.AUTHOR) LIKE '%\_bot\_%'
OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
AND ERS.BODY <> '[deleted]'
AND LENGTH(ERS.removal_reason) = 0
AND ERS.BODY LIKE '% %'
AND ERS.AUTHOR <> 'AutoModerator';
```
With this query we remove "AutoModerator" user which every subreddit uses it for moderation purposes, It filters 286.444 rows.

```PostgreSQL
SELECT
COUNT(*)
FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
WHERE 1=1
AND LENGTH(ERS.BODY) > 2
AND NOT(LOWER(ERS.AUTHOR) LIKE '%\_bot\_%' OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
AND ERS.BODY <> '[deleted]'
AND LENGTH(ERS.removal_reason) = 0
AND ERS.BODY LIKE '% %'
AND ERS.AUTHOR <> 'AutoModerator'
AND ERS.AUTHOR <> '[deleted]'
```
Filtering authors which they deleted their account removes 305.983 rows.

## Cleaning data

Using sql analysis we found out which data to ignore, we must clean data before working on it.

In [1]:
import pandas as pd
import numpy as np
import psycopg2
import time
import math

conn_string = 'host={pghost} port={pgport} dbname={pgdatabase} user={pguser} password={pgpassword}'.format(pgdatabase='MEF-BDA-PROD',pguser='postgres',pgpassword='123',pghost='localhost',pgport='5432')
conn=psycopg2.connect(conn_string)
cur=conn.cursor()

def check_if_table_exists(schema,table):
    cur.execute("select exists(select * from information_schema.tables where table_schema='{schema}' AND table_name='{table}')".format(schema=schema, table=table))
    return cur.fetchone()[0]

def check_if_index_exists(index):
    cur.execute("SELECT EXISTS(SELECT * FROM PG_CLASS WHERE relname = '{index}')".format(index=index))
    return cur.fetchone()[0]

if(check_if_table_exists('EDW','DWH_REDDIT_COMMENTS')):
    print('Table already exists')   
else:
    start_time = time.time()
    cur.execute('set time zone UTC;')
    cur.execute("""
    CREATE TABLE "EDW"."DWH_REDDIT_COMMENTS" AS 
    SELECT
    ROW_NUMBER() OVER (ORDER BY ERS.ID) AS ID,
    TO_TIMESTAMP(GREATEST(ERS.CREATED_UTC ,CAST(ERS.EDITED AS INTEGER))) AS DATE,
    ERS.SUBREDDIT,
    ERS.AUTHOR,
    ERS.AUTHOR_FLAIR_TEXT,
    ERS.SCORE,
    ERS.BODY AS COMMENT
    FROM "ODS"."EXT_REDDIT_COMMENTS" ERS
    WHERE 1=1
    AND LENGTH(ERS.BODY) > 2
    AND NOT(LOWER(ERS.AUTHOR) LIKE '%\_bot\_%' OR LOWER(ERS.AUTHOR) LIKE '%\-bot\-%')
    AND NOT(LOWER(REPLACE(ERS.BODY,'''','')) LIKE '%im a bot%')
    AND ERS.BODY <> '[deleted]'
    AND LENGTH(ERS.removal_reason) = 0
    AND ERS.BODY LIKE '% %'
    AND ERS.AUTHOR <> 'AutoModerator'
    AND ERS.AUTHOR <> '[deleted]';
    """)
    cur.execute('COMMIT;')
    print("Table created in {execute_time} seconds".format(execute_time=math.trunc(time.time()-start_time)))

if(check_if_index_exists('IDX_DWH_REDDIT_COMMENTS#01')):
    print('Index already exists')
else:
    start_time = time.time()
    cur.execute("""
    CREATE INDEX "IDX_DWH_REDDIT_COMMENTS#01" 
    ON "EDW"."DWH_REDDIT_COMMENTS" USING BTREE(
        "id" ASC NULLS LAST,
        "date" ASC NULLS LAST
    )
    TABLESPACE PG_DEFAULT;
    """)
    cur.execute('COMMIT;')
    print("Index created in {execute_time} seconds".format(execute_time=math.trunc(time.time()-start_time)))

Table already exists
Index already exists


1. We filtered our data and transformed epoch date to readable date and added numeric id to work our data with batch processing.
    It reduced our row count 54.504.410(with 33.3GB) to 48.690.746(with 24.5GB) with 11% reduction in rows and 27% reduction in size.

2. Added index to increase our read speed from table.

In [2]:
import os
import re
import urllib
from urllib.request import urlopen
import fsspec
import xlrd
import xlsxwriter
from pandas import DataFrame

def download_file_if_not_exists(file_url,file_name):
    start_time = math.trunc(time.time())
    if(os.path.exists(file_name) and os.stat(file_name).st_size==0):
        os.remove(file_name)
    if(not(os.path.exists(file_name))):
        urllib.request.urlretrieve(file_url,file_name)
        with open(file_name, 'r+', errors='ignore', encoding="utf-8") as f:
            file_text = f.read()
            file_text = re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ' '), file_text).replace('\\','').replace('"','').replace("'",'')
            f.seek(0)
            f.write(file_text)
            f.truncate()
    end_time = math.trunc(time.time())
    if(start_time!=end_time):
        print("File downloaded and cleaned in {execute_time} seconds".format(execute_time=end_time-start_time))

file_name = "DatafinitiElectronicsProductsPricingData.csv"
file_url = "https://query.data.world/s/n7byb65oqj47oro2btcqqyas62zclv"
download_file_if_not_exists(file_url,file_name)

file_name = "electronic_products_pricing_df.xlsx"
if(os.path.exists(file_name)):
    electronic_products_pricing_df = pd.read_excel(file_name)
else:
    electronic_products_pricing_df = pd.read_csv("DatafinitiElectronicsProductsPricingData.csv", encoding="utf-8")
    electronic_products_pricing_df = electronic_products_pricing_df.loc[:, ~electronic_products_pricing_df.columns.str.contains('^Unnamed')]
    electronic_products_pricing_df = electronic_products_pricing_df[electronic_products_pricing_df["prices.currency"] == "USD"]
    electronic_products_pricing_df = electronic_products_pricing_df[["name","brand","categories","prices.amountMax"]]
    electronic_products_pricing_df = electronic_products_pricing_df.groupby(["name","brand","categories"]).mean()
    electronic_products_pricing_df = electronic_products_pricing_df.reset_index()
    electronic_products_pricing_df = electronic_products_pricing_df.rename(columns = {"prices.amountMax":"average_price"})
    electronic_products_pricing_df["keys"] =  electronic_products_pricing_df[["name","brand","categories"]].agg(' '.join, axis=1, ).str.lower()
    electronic_products_pricing_df.to_excel(file_name, engine='xlsxwriter')

file_name = "keys_pricing_df.xlsx"
if(os.path.exists(file_name)):
    keys_pricing_df = pd.read_excel(file_name,index_col=0)
else:
    keys_pricing_df = DataFrame(electronic_products_pricing_df["keys"].replace('&','').str.split(' ').tolist(), index=electronic_products_pricing_df["average_price"]).stack()
    keys_pricing_df = keys_pricing_df.reset_index()
    keys_pricing_df.columns = ["average_price","level","key"]
    keys_pricing_df = keys_pricing_df[keys_pricing_df["key"] != "&"]
    keys_pricing_df = keys_pricing_df[["key","average_price"]]
    keys_pricing_df["key"] = keys_pricing_df["key"].str.replace(r'[^A-Za-z0-9]+','')
    keys_pricing_df["key"] = keys_pricing_df["key"][~keys_pricing_df["key"].str.isnumeric()]
    keys_pricing_df = keys_pricing_df[keys_pricing_df["key"] != ""] #Used this after realizing dropna doesn't work for some reason
    keys_pricing_df = keys_pricing_df.dropna()
    keys_pricing_df = keys_pricing_df.groupby(["key"]).max()
    keys_pricing_df = keys_pricing_df.reset_index()
    keys_pricing_df = keys_pricing_df.rename(columns = {"average_price":"max_price"})
    keys_pricing_df.to_excel(file_name, engine='xlsxwriter')

keys_pricing_df

Unnamed: 0,key,max_price
0,000mah,93.374000
1,0g03674,209.851250
2,1000watt,998.326667
3,1000x,191.996667
4,100400mm,1574.278571
...,...,...
2926,zubehr,58.730000
2927,zubie,99.990000
2928,zuxbbweveteztffywrybecztecqtezfbc,1799.970000
2929,zxqyvbwuwyydcq,18.158000


We are getting some price and key data to analize how much we would gain if we presented adds based on their comments.

## Removing stopwords and using Natural Language Processing while Detecting Language

Stopwords are commonly used words that must be not stored because they do not give any significant value to analyze. We are going to remove them from our table so we can lower our datasize and proccess our data faster.

We must test which libraries to use and how to use them;

In [3]:
import pycld2

value = 'Testing this value'
print("String '{test_value}' returns: {most_confident_language_tuple}\n".format(test_value=value,most_confident_language_tuple=pycld2.detect(value)[2][0])) #This will return the language with highest confidence score.
value = 'Bu değeri deniyoruz'
print("String '{test_value}' returns: {most_confident_language_tuple}\n".format(test_value=value,most_confident_language_tuple=pycld2.detect(value)[2][0]))
value = 'Test text'
print("String '{test_value}' returns: {most_confident_language_tuple}\n".format(test_value=value,most_confident_language_tuple=pycld2.detect(value)[2][0]))

String 'Testing this value' returns: ('ENGLISH', 'en', 95, 1077.0)

String 'Bu değeri deniyoruz' returns: ('TURKISH', 'tr', 95, 1706.0)

String 'Test text' returns: ('Unknown', 'un', 0, 0.0)



In [5]:
import nltk
import pycld2
from nltk.corpus import stopwords

nltk.download('stopwords', quiet=True)
stop = stopwords.words('english')

sql_command = 'SELECT * FROM "{schema}"."{table}" WHERE ID = 15;'.format(schema='EDW', table='DWH_REDDIT_COMMENTS')
df = pd.read_sql(sql_command,conn)
print("Original comment:\n{comment}\n".format(comment=df['comment'][0]))

df['comment'] = df['comment'].apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stop)]))

print("Comment after removing stopwords:\n{comment}\n".format(comment=df['comment'][0]))

start_time = time.time()
cur.execute("""
UPDATE "EDW"."DWH_REDDIT_COMMENTS"
SET "comment" = %(comment)s
WHERE "id" = %(id)s
""", {'comment': str(df['comment'][0]), 'id': int(df['id'][0])})

print("Updated record(s) in {execute_time} seconds\n".format(execute_time=math.trunc(time.time()-start_time)))

sql_command = 'SELECT * FROM "{schema}"."{table}" WHERE ID = 15;'.format(schema='EDW', table='DWH_REDDIT_COMMENTS')
df = pd.read_sql(sql_command,conn)
cur.execute('ROLLBACK;')
print("Comment in table:\n{comment}\n\n".format(comment=df['comment'][0]))

df['comment_language_code'] = pycld2.detect(str(df['comment']))[2][0][1]
df['comment_language'] = pycld2.detect(str(df['comment']))[2][0][0].lower().replace('unknown','english')

df
#Since we are testing here we didn't commit to database so our changes are going to be rolled back after our session dies. It will be used after completing our test.


Original comment:
fault think that. hey helps well. least talk it. shot dog. 10 years. home partying new years going shoot .40 so... well loading outside shot accidentally. right dog. 5-7 tequila shots though go sleep would wake would okay. well called vet shot could saved him. least pain. woke saw dead dog yard… cried hours. old dog still good years him. fault sitting little tiny mouse understandable man discharging firearm shitzu. feel bad.

Comment after removing stopwords:
fault think that. hey helps well. least talk it. shot dog. 10 years. home partying new years going shoot .40 so... well loading outside shot accidentally. right dog. 5-7 tequila shots though go sleep would wake would okay. well called vet shot could saved him. least pain. woke saw dead dog yard… cried hours. old dog still good years him. fault sitting little tiny mouse understandable man discharging firearm shitzu. feel bad.

Updated record(s) in 0 seconds

Comment in table:
fault think that. hey helps well. leas

Unnamed: 0,id,date,subreddit,author,author_flair_text,score,comment,comment_language_code,comment_language
0,15,2015-05-01 00:00:00+00:00,offmychest,Zekkystyle,,14,fault think that. hey helps well. least talk i...,en,english


It reduces our data well and it still makes sense.

In [13]:
import nltk
import pycld2
from nltk.corpus import stopwords

start_row = 28000000
row_per_loop = 29000000
end_row = row_per_loop

nltk.download('stopwords', quiet=True)
stop = stopwords.words('english')
sql_command = """SELECT * FROM "{schema}"."{table}" WHERE ID BETWEEN 29708843 AND 29708899;""".format(schema='EDW', table='DWH_REDDIT_COMMENTS', start_row=start_row, end_row=end_row)
df = pd.read_sql(sql_command,conn)

df['comment_language_code'] = df['comment'].apply(pycld2.detect).apply(lambda x: x[2][0][1])
df['comment_language'] = df['comment'].apply(pycld2.detect).apply(lambda x: x[2][0][0]).str.lower().replace('unknown','english')

df

#df[df['comment_language_code'] != 'en']
#Since we are testing here we didn't commit to database so our changes are going to be rolled back after our session dies. It will be used after completing our test.


Unnamed: 0,id,date,subreddit,author,author_flair_text,score,comment,comment_language_code,comment_language
0,29708843,2015-05-19 22:17:30+00:00,Turkey,VoodooRush,Swaziland,2,Peki ülkeyi Atatürk'ü Samsun'a göndermeyi gere...,tr,turkish
1,29708844,2015-05-19 22:17:30+00:00,leagueoflegends,Gigglestomp123,,2,How do you skin that which has no skin?,en,english
2,29708845,2015-05-19 22:17:30+00:00,badhistory,arminius_saw,I would totally bang Hitler (with a Browning HP),11,...do you mean Moratorium or has /r/badhistory...,en,english
3,29708846,2015-05-19 22:17:30+00:00,Mariners,Alakith,,2,i dont really get the hate for Willie. Hes a b...,en,english
4,29708847,2015-05-19 22:17:30+00:00,WTF,Wrinklestiltskin,,-1,Too forced...,un,english
5,29708848,2015-05-19 22:17:30+00:00,hockey,chelsea-fan111,WPGOldWhiteNHL,67,I would like to know why Manitoba was shown na...,en,english
6,29708849,2015-05-19 22:17:30+00:00,rupaulsdragrace,KVDH85,Danny Devito as Tony Soprano as the Fonz as Mi...,4,so happy to see their love and togetherness co...,en,english
7,29708850,2015-05-19 22:17:30+00:00,hearthstone,Jbladez,,3,HAH your flair makes this the most hypocritica...,en,english
8,29708851,2015-05-19 22:17:30+00:00,minnesotatwins,ChiefakaJames,,6,Francisco Liriano pitching for the Pirates. He...,en,english
9,29708852,2015-05-19 22:17:30+00:00,touhou,Marv134,We found our very own sun,3,"And I'm gonna do this and that to you, and the...",en,english


In [46]:
cur.close()
conn.close()

## Sources:

 1. [About Reddit](https://en.wikipedia.org/wiki/Reddit)

 2. [Data source](https://www.kaggle.com/reddit/reddit-comments-may-2015/notebooks)

 3. [Checking if a table exist with psycopg2 on postgreSQL](https://stackoverflow.com/questions/1874113/checking-if-a-postgresql-table-exists-under-python-and-probably-psycopg2)

 4. [Using current time in UTC as default value in PostgreSQL. This is important because date is utc in the data](https://stackoverflow.com/questions/16609724/using-current-time-in-utc-as-default-value-in-postgresql)

 5. [Creating multicolumn index on PostgreSQL](https://www.postgresql.org/docs/9.6/indexes-multicolumn.html)

 6. [Checking if index exist](https://stackoverflow.com/questions/45983169/checking-for-existence-of-index-in-postgresql)

 7. [How to execute start time and end time in python](https://www.codegrepper.com/code-examples/delphi/how+to+execute+from+start+time+to+end+time+in+python)

 8. [Truncating numbers in python](https://www.w3schools.com/python/ref_math_trunc.asp)

 9. [Removing stopwords](https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe)

 10. [Prevent SQL Injection in Python](https://realpython.com/prevent-python-sql-injection/)

 11. [Preventing SQL Injection resulted errors but It needed to be done, data type conversation is the key here](https://stackoverflow.com/questions/39564755/programmingerror-psycopg2-programmingerror-cant-adapt-type-numpy-ndarray)

 12. [About stopwords](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)

 13. [How to detect language](https://github.com/aboSamoor/pycld2)

 14. [Increasing timeout while installing new packages](https://stackoverflow.com/questions/43298872/how-to-solve-readtimeouterror-httpsconnectionpoolhost-pypi-python-org-port)

 15. [PyCld2 is only works in linux systems](https://www.lfd.uci.edu/~gohlke/pythonlibs/)

 16. [Replacing text to change unknown values to english](https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe)

 17. [Electronic Products and Pricing Data](https://data.world/datafiniti/electronic-products-and-pricing-data/workspace/file?filename=DatafinitiElectronicsProductsPricingData.csv)

 18. [Remove all commas between quotes](https://stackoverflow.com/questions/38336518/remove-all-commas-between-quotes)

 19. [Check if file exist in directory](https://stackoverflow.com/questions/28144529/how-to-check-if-file-already-exists-if-not-download-on-python)

 20. [Download files](https://stackabuse.com/download-files-with-python/)

 21. [Deleting empty files](https://stackoverflow.com/questions/48046729/delete-empty-files)

 22. [Removing unnamed columns](https://stackoverflow.com/questions/43983622/remove-unnamed-columns-in-pandas-dataframe)

 23. [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows)

 24. [Dropping numeric values](https://stackoverflow.com/questions/48636170/dropping-numeric-rows-from-dataframe-in-python)