# Advanced Data Mining

## Data preprocessing

### Reducing dataset size

In [1]:
!pip install -r requirements.txt

[0mDefaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import os.path

import pandas as pd

from config import DATASET_SIZE
from scripts.colors import bold, error, success, warning
from scripts.utils import checkpoint, setup

Matplotlib created a temporary cache directory at /tmp/matplotlib-xa8u9ko9 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [3]:
setup()

In [4]:
pickle_path = "data/filtered.pkl"

if not os.path.exists(pickle_path):
    # Load dataset from source
    warning("Local dataset not found")
    print("Reading dataset from Postgres... ", end="")
    conn = "postgresql://postgres:adm@db:5432/adm"
    query = "SELECT * FROM filtered"
    df = pd.read_sql(query, conn)
    success("OK")

    # Cache dataset on local machine
    print("Saving dataset to disk... ", end="")
    df.to_pickle(pickle_path)
    success("OK")
else:
    print("Dataset found, reading from disk... ", end="")
    # Load cached dataset
    df = pd.read_pickle(pickle_path)
    success("OK")

rows, cols = df.shape
print("Dataframe contains", bold(f"{rows} rows"), "and", bold(f"{cols} columns"))

Dataset found, reading from disk... [92mOK[0m
Dataframe contains [1m1612409 rows[0m and [1m19 columns[0m


Before doing anything, remove duplicates within the same subreddit

In [5]:
rows_count = df.shape[0]
df = df.drop_duplicates(subset=['body', 'subreddit'])
rows_count = rows_count - df.shape[0]
print(rows_count, "row" if rows_count == 1 else "rows", "affected")

rows, cols = df.shape
print("Dataframe contains", bold(f"{rows} rows"), "and", bold(f"{cols} columns"))

7698 rows affected
Dataframe contains [1m1604711 rows[0m and [1m19 columns[0m


Reduce dataset to several thousand samples

In [6]:
print("Reducing dataset to", bold(f"{DATASET_SIZE} rows"))  # change in config.py

Reducing dataset to [1m20000 rows[0m


In [7]:
if DATASET_SIZE <= 0:
    error("Invalid dataset size")

n_classes = len(df['subreddit'].unique())
class_size = DATASET_SIZE // n_classes

From every class, take `class_size` elements - balance dataset

In [8]:
class_samples = []

for subreddit in df['subreddit'].unique():
    _df = df[df['subreddit'] == subreddit].sample(class_size)
    class_samples.append(_df)

df = pd.concat(class_samples).reset_index(drop=True)

print(df['subreddit'].value_counts())
print()

rows, cols = df.shape
print("Dataframe contains", bold(f"{rows} rows"), "and", bold(f"{cols} columns"))

subreddit
gaming         4000
politics       4000
technology     4000
science        4000
programming    4000
Name: count, dtype: int64

Dataframe contains [1m20000 rows[0m and [1m19 columns[0m


In [9]:
_ = checkpoint("01-balanced", dataframe=df)