Skip to content

usmanumar2010/TextAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Analysis

This repository demonstrates the use of Natural Language Processing operations with the help of python based API's.

Operations that are performed here are :

1.Word Processing

2.Stemmed/Lemmatized Word Count

3.Part of Speech Count

4.Document Similarity

Prerequisites

  • Python 3.5+

  • MongoDB 3+

  • virtualenv -- pip install virtualenv

  • Flask -- pip install flask

  • PyMongo -- pip install pymongo

  • NLTK -- pip install -U nltk

  • Gensim — pip install -U gensim

Installatoin

  • Clone the Repository
  • Run a mongo server with mongod in command Prompt
  • cd ConnectavoAssignment/flask/Scripts
  • Write activate flask in Command Prompt in the above mentioned folder
  • cd ../../ You are in the main ConnectavoAssignment folder now
  • Run DatabaseConnection.py with pyhton DatabaseConnection.py
  • When the Server Start Running
  • Write the following link in the tab [http://127.0.0.1:5000/]

Download

I have used the standard stop words that are present in nltk for that write nltk.download('stopwords') also do nltk.download('punkt').

For Ubuntu 16.0.4

  • Python 3.5+

  • MongoDB 3+

    To install MongoDB Follow this link

  • virtualenv :

    To install virtual environment sudo apt-get install python-virtualenv

  • Flask :

    To install Flask Follow this link

  • PyMongo -- pip install pymongo

  • NLTK -- pip install -U nltk

  • Gensim — pip install -U gensim

Download

I have used the standard stop words that are present in nltk for that write nltk.download('stopwords') also do nltk.download('punkt').

Installatoin

  • Clone the Repository
  • Run a mongo server with command sudo service mongod start in a First terminal instance
  • Open another terminal instance
  • Change the directory to cd ConnectavoAssignment
  • In ConnectavoAssignment Directory write virtualenv flask-env
  • write source flask-env/bin/activate on this terminal instance
  • After activating environment write pip install Flask
  • Now Run DatabaseConnection.py with pyhton DatabaseConnection.py
  • When the Server Start Running
  • Write the following link in the tab [http://127.0.0.1:5000/]

Task 1

Word Processing

I have used the standard stop words in this task with nltk.download('stopwords').This Api will return the refined book without stop words .

Task 2

Stemmed / Lemmatized Word Count

I have used Snowball Stemmer in this task it is the latest Stemmer with many supportive languages and it is more advance then the porter Stemmer. Quoting Quora here ,Snowball is obviously more advanced in comparison with Porter and, when used, gives considerably more reliable results.

Task 3

Parts of Speech (PoS) Count

I have used nltk for gathering the part of speech in the book ,nltk.pos_tag(tokens) this function of nltk tagged each tokken or word with the respective part of speech.

Task 4

Document Similarity / Difference

I have used Gensim in this task because Gensim is so fast,Gensim processes data in a streaming fashion with Memory independence,Efficient Implementations .It is a very well optimized, but also highly specialized, library for doing jobs in the periphery of "WORD2VEC". That is: it offers an easy, surpringly well working and swift AI-approach to unstructured raw texts, based on a shallow neural network. If you are interested in prodution, or in getting deeper insights into neural networks, you might also have a look on TensorFlow, which offers a mathematically more generalized model, yet to be paid by some ‘unpolished’ performance
and scalability issues by now. Reference this Why Gensim and Why Fast Approach