Portland Data Science Natural Language Processing

Description

Work on Natural Language Processing at the Portland Data Science Meetup

System Requirements

Python, Pandas, Numpy, Requests, BS4, NLTK, Langdetect, Lxml

Files

Main

isitenglish.py

Input files

boardgame-frequent-user-comments.csv

Output files

english_comments.csv, non_english_comments.csv, google_english_comments.csv, google_non_english_comments.csv

Implementation

This python script looks a board game review comments and attempts to determine if the comments are english or not. To do this, it scraps the 100 most common words in English from a wikipedia page, then breaks the board game comments into individual words. If a comment contains one of the 100 most common words, it is deemed English.

This algorithm is somewhat limiting as it fails if any of the 100 most common words in English happen to exist in another language. Also, this algorithm fails if the comment is too short (a common challenge). The code will warn the user that a comment is too short (less than 4 words) to process.

Further, I loaded Google's public port of one of their language detection algorithms to compare my results to their results. This algorithm had a similar challenge with short comments as well

Results are printed in 4 csv files: english comments and non-english comments (my algorithm) and google english comments and google non-english comments.

Below shows a snippet of the table for the comments determined to be English and the comments determined to be non-English. As you can see, some of the shorter English comments were determined to be non-English. Additionally, there are some comments that are just the number 1 (users will rate a game a score of 1 and comment "1") which are determined to be non-English.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Results_EnglishComments.png		Results_EnglishComments.png
Results_NonEnglishComments.png		Results_NonEnglishComments.png
boardgame-frequent-user-comments.csv		boardgame-frequent-user-comments.csv
english_comments.csv		english_comments.csv
google_english_comments.csv		google_english_comments.csv
google_non_english_comments.csv		google_non_english_comments.csv
google_non_english_comments_ANNOTATED.csv		google_non_english_comments_ANNOTATED.csv
isitenglish.py		isitenglish.py
non_english_comments.csv		non_english_comments.csv
non_english_comments_ANNOTATED.csv		non_english_comments_ANNOTATED.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portland Data Science Natural Language Processing

Description

System Requirements

Files

Main

Input files

Output files

Implementation

English comments

Non-English comments

About

Releases

Packages

Languages

savanaconda/PortlandDataScience_NaturalLanguageProcessing

Folders and files

Latest commit

History

Repository files navigation

Portland Data Science Natural Language Processing

Description

System Requirements

Files

Main

Input files

Output files

Implementation

English comments

Non-English comments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages