## Sampling after data cleaning

Sampling is performed to test the relationship between database size and query running time.
<br>
Below are the requirements of this project:
<br>&emsp;&emsp;For each of the above query, you may want to test if your query gives correct results. Please perform the following tasks:
<br>&emsp;&emsp;a. Record the running time.
<br>&emsp;&emsp;b. Cut the size of the database that you used by half, re-run the queries, and record the new running time
<br>&emsp;&emsp;c. Further cut the size of the database in b) by half, re-run the queries, and record the new running time.



In [2]:
import pandas as pd
import os

In [4]:
def half_and_quarter_sampling(df_name, random_state=1024):
    df = pd.read_csv(os.path.join('csv_files', df_name))
    print('Start sampling {}...'.format(df_name))
    half_df = df.sample(frac=0.5, random_state=random_state)
    quarter_df = df.sample(frac=0.25, random_state=random_state)

    half_keys = half_df['_key'].tolist()
    quarter_keys = quarter_df['_key'].tolist()

    
    half_df.to_csv(os.path.join('csv_files/half', 'half_' + df_name), index=False)
    quarter_df.to_csv(os.path.join('csv_files/quarter', 'quarter_' + df_name), index=False)

    return half_keys, quarter_keys

In [5]:
os.makedirs('csv_files/half', exist_ok=True)
os.makedirs('csv_files/quarter', exist_ok=True)

filenames = ['article.csv', 'inproceedings.csv', 'incollection.csv', 'phdthesis.csv', 'mastersthesis.csv', 'www.csv', 'proceedings.csv', 'book.csv']
HKs, QKs = [], []

for fn in filenames:
    half_keys, quarter_keys = half_and_quarter_sampling(fn)
    HKs.extend(half_keys)
    QKs.extend(quarter_keys)


Start sampling article.csv...
Start sampling inproceedings.csv...
Start sampling incollection.csv...
Start sampling phdthesis.csv...
Start sampling mastersthesis.csv...
Start sampling www.csv...
Start sampling proceedings.csv...
Start sampling book.csv...


### Filter relation of authoring

In [7]:
R_author = pd.read_csv('csv_files/R_author.csv')

In [9]:
half_R_author = R_author[R_author['_key'].isin(HKs)]
quarter_R_author = R_author[R_author['_key'].isin(QKs)]
print(R_author.shape)
print(half_R_author.shape)
print(quarter_R_author.shape)

(25389959, 2)
(12694301, 2)
(6346685, 2)


In [10]:
half_R_author.to_csv('csv_files/half/half_R_author.csv', index=False)
quarter_R_author.to_csv('csv_files/quarter/quarter_R_author.csv', index=False)

del R_author, half_R_author, quarter_R_author

### Filter relation of editing

In [11]:
R_editor = pd.read_csv('csv_files/R_editor.csv')

In [12]:
half_R_editor = R_editor[R_editor['_key'].isin(HKs)]
quarter_R_editor = R_editor[R_editor['_key'].isin(QKs)]
print(R_editor.shape)
print(half_R_editor.shape)
print(quarter_R_editor.shape)

(141330, 2)
(70433, 2)
(34982, 2)


In [13]:
half_R_editor.to_csv('csv_files/half/half_R_editor.csv', index=False)
quarter_R_editor.to_csv('csv_files/quarter/quarter_R_editor.csv', index=False)

del R_editor, half_R_editor, quarter_R_editor