# Post Here: Sub Reddit Suggestion

Using the top 1000 subreddits and the top 100 posts from each subreddit, we will try to predict and suggest a good subreddit to post to, based upon the text the user inserts.

## Loading the data 

In [0]:
import os

# Load list of csv file names
csv_list = os.listdir('data')

In [0]:
import pandas as pd

# Put data from csv files into dataframe

data = pd.DataFrame()
name = []
titles = []
for x in csv_list:
    if x[-3:] == 'csv' and x[:-4]:             # Ensures that file is a csv 
        df = pd.read_csv('data/{}'.format(x))  # Create a temporary dataframe (df) to load each csv into
        for title in df['title'][:100]:        # For each post (limited to 100 in each subreddit) in each csv add a new row
            name.append(x[:-4])                # Remove '.csv' from file name and add to list
            titles.append(title + x[:-4])      # Add title of post to list + the name of the subreddit to improve accuracy 
            
# Add lists to DataFrame as columns
data['name'] = name
data['post_title'] = titles

In [0]:
# Show data
data.head()

Unnamed: 0,name,post_title
0,1200isplenty,Weighing yourself after a few days of bad eati...
1,1200isplenty,The holiday season truly is magic1200isplenty
2,1200isplenty,I feel personally attacked right now1200isplenty
3,1200isplenty,Some Wednesday motivation!1200isplenty
4,1200isplenty,Oh no my diet plan has been revealed1200isplenty


## Split data

In [0]:
X = data['post_title']
y = data['name']

In [0]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [0]:
# Show shapes of each set
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((74402,), (24801,), (74402,), (24801,))

## Build Model

In [0]:
# Imports
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# One-Hot Encoding
bow_vector = CountVectorizer(stop_words="english", ngram_range=(1,1))

# Transforms text to feature vectors that can be used as input 
tfidf_vector = TfidfVectorizer()

# The ML Algorithm used
classifier = LogisticRegression()

# Pipeline
pipe = Pipeline([('vectorizer', bow_vector),
                 ('classifier', classifier)])

# Train
pipe.fit(X_train,y_train)



Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
        

## Model Accuracy

In [0]:
from sklearn import metrics

# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))

Logistic Regression Accuracy: 0.5635256642877303


In [0]:
# If the title of the users post includes 'keyboard', it suggests mechanical keyboard! Looks like it is working
pipe.predict(['this is my new keyboard!'])

array(['MechanicalKeyboards'], dtype=object)

In [0]:
# Save the model into a plk file : Will save to the same directory as the location of the notebook
import pickle

model_file_name = 'nlp_model.plk'
model_pkl = open(model_file_name, 'wb')
pickle.dump(pipe, model_pkl)
model_pkl.close()