Text Mining in R

Sentiment Analysis Text Mining using R

The project is about searching the text mining for classification using bag of words #bagofwords and applying machine learning models on this.

Few popular hashtags -

`#R` `#MachineLearning` `#NLP`

`#patternlearning` `#BagofWords` `#textanalytics`

Motivation

Nowadays, a daily increase of online available data leads to a growing need for that data to be organized and regularized. Textual data is all around us starting from web pages, e-books, media articles to emails or user comments. There are a lot of cases where automatic text classification would accelerate processing time (for example, detection of spam pages, personal email sorting, tagging products or document filtering). We can say that all organizations (e.g. academia, marketing or government) that deal with a lot of unstructured text, could handle that data much easier if it was standardized by categories/tags. This Dataset is a collection newsgroup documents. The 4 newsgroups collection can be used for experiments in text applications of machine learning techniques, such as text classification and text clustering.

About the Project

What is Text Mining?

Text classification or text categorization is an activity of labelling natural language texts with relevant predefined categories. The idea is to automatically organize text in different classes. It can drastically simplify and speed-up your search through the documents or texts!

Steps involved in this project

3 major steps in Text-Mining-in-R code :

While training and building a model keep in mind that the first model is never the best one, so the best practice is the “trial and error” method. To make that process simpler, you should create a function for training and in each attempt save results and accuracies.
I decided to sort the EDA process into two categories: general pre-processing steps that were common across all vectorizers and models and certain pre-processing steps that I put as options to measure model performance with or without them
Accuracy was chosen as a measure of comparison between models since greater the accuracy, better the model performance on test data.

Explanation

First of all, I've created a Bag of Words file. This file clean_data.R contains all the methods to preprocess and generate bag of words. We use Corpus library to handle preprocessing and to generate Bag of Words .
The following general pre-processing steps were carried out since any document being input to a model would be required to be in a certain format:

Converting to lowercase
Removal of stop words
Removing alphanumeric characters
Removal of punctuations
Vectorization: TfVectorizer was used. The model accuracy was compared with those that used TfIDFVectorizer. In all cases, when TfVectorizer was used, it gave better results and hence was chosen as the default Vectorizer.

The following steps were added to the pre-processing steps as optional to see how model performance changed with and without these steps: 1. Stemming 2. Lemmatization 3. Using Unigrams/Bigrams

Confusion Matrix for Support Vector Machine using Bag of Words Generated using clean_data.r

> confusionMatrix(table(predsvm,data.test$folder_class))
Confusion Matrix and Statistics

       
predsvm  1  2  3  4
      1 31  0  0  0
      2  0 29  6  0
      3  0  3 28  0
      4  0  0  0 23

Overall Statistics
                                          
               Accuracy : 0.925           
                 95% CI : (0.8624, 0.9651)
    No Information Rate : 0.2833          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8994          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4

-The most interesting deduction is that the more specific the newsgroup topic is, the more accurate that the Naïve Bayes classifier can determine what newsgroup a document belongs to and the converse is also true where the less specific the newsgroup is, the accuracy rate plummets.

-We can see this in Accuracy where every newsgroup that isn’t a misc will always have an accuracy rate of at least 50%. The bottom newsgroups for terms of accuracy rate are all misc which includes a 0.25% accuracy rate for talk.politics.misc.

-A reason for this is that the posts that are written in misc newsgroups are rarely related to the actual root of the newsgroup. The misc section caters to other topics of discussion other than the “root newsgroup” meaning that it is much easier for the classifier to confuse a document from a misc newsgroup with another newsgroup and much harder for the classifier to even consider the root newsgroup since topics regarding the root newsgroup at posted there instead.

-For example, a post about guns is posted in talk.religion.misc can be easily classified as being talk.politics.guns because it would have to use similar words found in the posts found in talk.politics.guns. Likewise, posts about politics in talk.politics.misc are less likely because you are more likely to post in or talk.politics.guns (where wildcard is the relevant section for the type of politics to be discussed).

Libraries Used

Installation

Install randomForest using pip command: install.packages("randomForest")
Install caret using pip command: install.packages("caret")
Install mlr using pip command: install.packages("mlr")
Install MASS using pip command: install.packages("MASS")

How to run?

Project Reports

Download for the report.

Useful Links

Why Term Frequency is better than TF-IDF for text classification
Naïve Bayes Classification for 20 News Group Dataset
Analyzing word and document frequency: tf-idf
Natural Language Processing
K Nearest Neighbor in R
MLR Package

Related Work

Text Mining Analyzer - A Detailed Report on the Analysis

Contributing

Clone this repository:

git clone https://github.com/iamsivab/Text-Mining-in-R.git

Check out any issue from here.
Make changes and send Pull Request.

Need help?

📧 Feel free to contact me @ balasiva001@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text Mining in R

Sentiment Analysis Text Mining using R

The project is about searching the text mining for classification using bag of words #bagofwords and applying machine learning models on this.

Few popular hashtags -

`#R` `#MachineLearning` `#NLP`

`#patternlearning` `#BagofWords` `#textanalytics`

Motivation

About the Project

What is Text Mining?

Steps involved in this project

Explanation

Libraries Used

Installation

How to run?

Project Reports

Useful Links

Related Work

Contributing

Need help?

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text Mining in R

Sentiment Analysis Text Mining using R

The project is about searching the text mining for classification using bag of words #bagofwords and applying machine learning models on this.

Few popular hashtags -

#R #MachineLearning #NLP

#patternlearning #BagofWords #textanalytics

Motivation

About the Project

What is Text Mining?

Steps involved in this project

Explanation

Libraries Used

Installation

How to run?

Project Reports

Useful Links

Related Work

Contributing

Need help?

License

`#R` `#MachineLearning` `#NLP`

`#patternlearning` `#BagofWords` `#textanalytics`