Natural_Language_Processing

The use of natural language processing has exploded over the last decade. Appilcations that require machines to understand natural human speech patterns are abundant and substantial improvements in these systems has increased their utility. Within the educational space NLP is used to interpret human speech for the prupose of understanding human problems and recently an online tutor passed a limited version of the Turing Test when it was indistinguishable from teaching assistants in a college class.

In this project, I completed three main tasks: processing a set of documents, running a sentiment analysis of these documents and then generating topic modelling of those documents. The documents used were student notes from class HUDK4050 (Education Data Mining) made last semester.

Datasets

csv.files from class-notes document

The files were classnotes from HUDK 4050. The variables we used were Title and Notes. Title indicates the topic and Notes indicates the content.

week-list.csv

The variables contain Title and week. Week indicates on which week the topic was learned.

Pacakages Required

install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("topicmodels")

Procedures

Making Wordcloud

Import all document files and then list of weeks file
Clean the html tags from the text
Process text using tm package with alternative processing.

Convert the data frame to the corpus format used in tm package
Remove spaces, pre-defined stop words ('the', 'a', etc...), numbers and punctuation
Convert upper case to lower case, words to stems for analysis
Convert corpus to a term document matrix so that each word can be analyzed indiviudally

Find common words by creating a data frame of word count
Generate a word cloud by setting the minimum frequency, scale, max word numbres and proportion words with 90 degree rotation (vertical words).
Merge with week list to have a variable representing weeks for each entry.
Create a Term Document Matrix and repeat step 5.

Sentiment Analysis

Match words in corpus to lexicons of positive and negative words
Generate an overall pos-neg score for each matched line between each word and the two lexicons
Geneate a visualization of the sum of the sentiment score over weeks with ggplot

LDA Topic Modelling

Term Frequency Inverse Document Frequency
Remove very uncommon terms (TFID freq < 0.1)
Remove non-zero entries, find the sum of words in each document, and divide by sum across rows
We then find out the most common terms in each topic and which documents belong to which topic
Generate sentiment for each week and one important topic for that week

Author

Meijuan Zeng, MS student in Learning Analytics at Columbia University

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
class-notes		class-notes
NLP.Rmd		NLP.Rmd
README.md		README.md
negative-words.txt		negative-words.txt
positive-words.txt		positive-words.txt
week-list.csv		week-list.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

class-notes

class-notes

NLP.Rmd

NLP.Rmd

README.md

README.md

negative-words.txt

negative-words.txt

positive-words.txt

positive-words.txt

week-list.csv

week-list.csv

Repository files navigation

Natural_Language_Processing

Datasets

csv.files from class-notes document

week-list.csv

Pacakages Required

Procedures

Making Wordcloud

Sentiment Analysis

LDA Topic Modelling

Author

About

Releases

Packages

tomato018/Natural_Language_Processing

Folders and files

Latest commit

History

Repository files navigation

Natural_Language_Processing

Datasets

csv.files from class-notes document

week-list.csv

Pacakages Required

Procedures

Making Wordcloud

Sentiment Analysis

LDA Topic Modelling

Author

About

Resources

Stars

Watchers

Forks