<a href="https://colab.research.google.com/github/yswetha95/Text-Mininig-NLP/blob/master/Text_Mining_basics_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLTK

## 1.Sentence Tokenization/Sentence Segmentation

This is the problem of dividing a string of written language into its component sentences

In [0]:
text="Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice"

sentences=nltk.sent_tokenize(text)

for w in sentences:
  print(w)


Backgammon is one of the oldest known board games.
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.
It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice


## 2.Word Tokenization/ Word Segmentation

This is the problem of dividing a string of written language into its component words

In [0]:
for w in sentences:
    words = nltk.word_tokenize(w.strip('.'))
    print(words)
    print()

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East']

['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice']



## 3.Text Lemmatization and Stemming

By using stemming and lemmatization, we can reduce all the derivationally related forms of a word to a common base form.

In [0]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive



In [0]:
stemmer.stem('Meeting')

'meet'

In [0]:
lemmatizer.lemmatize('Meeting',wordnet.NOUN)

'Meeting'

###Stemming

Stemming is a method in which the ends of words are chopped off. Stemming dont have the ability to get the Context. The main advantages of using Stemming are that it is easier to implement and usually run faster. Examples:Better/Good Meeting/meet

### Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

## 4.STOP Words

Some words like {the.a,an} repeat a lot in text and will be irrelevant while we apply machine learning models to text data. Hence these words should be removed.

In [0]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [0]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [w for w in words if w not in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


# Bag of Words

# TF-IDF

# Applied Text Mining in Python

## Findall

This code is used to find the required characters in the given Text using Regualar Expressions

In [0]:
import re
text5='ouagadougou'

re.findall(r'[aeiou]',text5)

re.findall(r'[^aeiou]',text5)

['g', 'd', 'g']

##Split

In [0]:
text6=text5.split('ou')
text6

['', 'agad', 'g', '']

In [0]:
'ou'.join(text6)

'ouagadougou'

In [0]:
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [0]:
[c for c in text5]

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [0]:
text8="  a quick brown fox       jumped over the lazy dog  "

In [0]:
text9=text8.split()
text9=' '.join(text9)

text9

'a quick brown fox jumped over the lazy dog'

In [0]:
text9.find('o')

10

In [0]:
text9.rfind('o')

40

In [0]:
text9.replace('o','O')


'a quick brOwn fOx jumped Over the lazy dOg'

In [0]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [0]:
text1=text1.rstrip()
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

13

In [0]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations']

In [0]:
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations']

In [0]:
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives', 'Nations']

In [0]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [0]:
len(set(text4))

5

In [0]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @swe NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@swe',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

In [0]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

In [0]:
import re # import re - a module that provides support for regular expressions

[w for w in text6 if re.search('@[A-Za-z0-9_]+', w)]

['@swe']

In [0]:
[w for w in text6 if re.search('@[A-Za-z0-9]+',w)]

['@swe']

In [0]:
text10 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text11 = text10.split(' ')

text11
print([w for w in text11 if re.search('@[A-Za-z0-9_]+',w)])

['@UN', '@UN_Women']


In [0]:
print([w for w in text11 if re.search('@\w+',w)])

['@UN', '@UN_Women']


# Working with Text Data in pandas

In [0]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.", 
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]
 
df=pd.DataFrame(time_sentences,columns=['cname'])

In [0]:
df['cname'].str.len()

0    46
1    50
2    49
3    49
4    54
Name: cname, dtype: int64

In [0]:
df['cname'].str.split().str.len()

0     7
1     8
2     8
3    10
4    10
Name: text, dtype: int64

In [0]:
df['cname'].str.contains('appointment')

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

In [0]:
# find how many times a digit occurs in each string
df['cname'].str.count(r'\d')

0    3
1    4
2    3
3    4
4    8
Name: cname, dtype: int64

In [0]:
df['cname'].str.count(r'\w')

0    36
1    39
2    38
3    37
4    40
Name: cname, dtype: int64

In [0]:
df['cname'].str.findall(r'((d?\d):(\d\d))')



0                   [(2:45, 2, 45)]
1                   [(1:30, 1, 30)]
2                   [(7:00, 7, 00)]
3                   [(1:15, 1, 15)]
4    [(8:10, 8, 10), (9:00, 9, 00)]
Name: cname, dtype: object

In [0]:
import nltk

nltk.download('popular')

# Referrals

https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

https://www.coursera.org/learn/python-text-mining/