Reddit Flair Detector

A web application to detect flairs of r/India posts. The application is live here

Prerequisites, Dependencies and Execution

Ensure that you have Python3 and pip installed on your specific distribution.

Open the Terminal and Clone the repository.

git clone https://github.com/ramanshgrover/redd-it-flair-detector/

Create a virtual environment.

python -m venv .env
source .env/bin/activate

Downloading the spaCy and nltk dependencies: Execute the following command on your terminal.

python -m spacy download en_core_web_sm

Enter the Python console using python3 , input the following and exit the shell

>>> import nltk
>>> nltk.download(‘stopwords’)

Now make sure you’re in the same directory as requirements.txt, and run the following command on your terminal.

pip3 install -r requirements.txt

Disclaimer: This may take a while.

Now, execute the following command: python main.py. It will direct to the localhost with the port. Copy the IP Address to a web browser and use the flask app.

Summary of the Methodology

To conclude, I sampled r/India flairwise, (200 per flair, within PRAW constraints). The aforementioned dataset consisted of Hindi, English and code-mixed (Hinglish) text data. In this task, I clustered out the Hindi and the English portions of the data point scraped (URL+Title+Post+Comments). One of the main properties of such texts is that the English and the Hindi parts generally exist in groups. Hence, I tried to isolate them.

I used the corpus generated from a dictionary, computed every word's Levenshtein distance with words in our corpus starting from 'r' and having a length in range (l-2, l+2) where l is the length of the word in consideration. But for most code-mixed tokens scraped, depending upon the context, there will be a large levenshtein distance with respect to the corpus. Hence, alotted a distance to every word and then finally applied the k-means algorithm to get two clusters of Code-Mixed Hinglish and English. Using the googletrans library, I translated the Code-Mixed Hindi written in Latin script to Devanagari script.

Upon further EDA, I made sure to preprocess every class of flair to be independent of the others (in terms of the text distribution itself) and found out that TFIDF word vectors performed best with SVC (C=10, gamma=1, and the rbf kernel) achieving ~80.51% testing accuracy with a surprisingly large overfit. For detailed analysis, intuitions, observations and working do check out my Jupyter Notebooks.

Vectorizer	Multinomial Naive Bayes	Logistic Regression	Random Forest	Support Vector Machine
Bag-of-Words	48.84	72.82	78.84	77.69
TF IDF	57.17	61.66	79.74	80.51

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
model		model
notebooks		notebooks
static		static
templates		templates
.DS_Store		.DS_Store
Demo.gif		Demo.gif
Procfile		Procfile
README.md		README.md
main.py		main.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

model

model

notebooks

notebooks

static

static

templates

templates

.DS_Store

.DS_Store

Demo.gif

Demo.gif

Procfile

Procfile

README.md

README.md

main.py

main.py

nltk.txt

nltk.txt

requirements.txt

requirements.txt

Repository files navigation

Reddit Flair Detector

Prerequisites, Dependencies and Execution

Summary of the Methodology

About

Releases

Packages

Contributors 2

Languages

ramanshgrover/redd-it-flair-detector

Folders and files

Latest commit

History

Repository files navigation

Reddit Flair Detector

Prerequisites, Dependencies and Execution

Summary of the Methodology

About

Resources

Stars

Watchers

Forks

Languages