# N-gram Language Model

Here's what you will learn in this project:

 - 3-gram language model on news documents (Reuters corpus)
 - Predict the next word in a sentence
 - Generate a random news text
 - Find probability of a sentence

## Loading required libraries and corpuses

In [None]:
import nltk
from nltk.corpus import reuters

# loading corpus
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## About the Dataset

Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words. 

In [None]:
reuters.categories()

['acq',
 'alum',
 'barley',
 'bop',
 'carcass',
 'castor-oil',
 'cocoa',
 'coconut',
 'coconut-oil',
 'coffee',
 'copper',
 'copra-cake',
 'corn',
 'cotton',
 'cotton-oil',
 'cpi',
 'cpu',
 'crude',
 'dfl',
 'dlr',
 'dmk',
 'earn',
 'fuel',
 'gas',
 'gnp',
 'gold',
 'grain',
 'groundnut',
 'groundnut-oil',
 'heat',
 'hog',
 'housing',
 'income',
 'instal-debt',
 'interest',
 'ipi',
 'iron-steel',
 'jet',
 'jobs',
 'l-cattle',
 'lead',
 'lei',
 'lin-oil',
 'livestock',
 'lumber',
 'meal-feed',
 'money-fx',
 'money-supply',
 'naphtha',
 'nat-gas',
 'nickel',
 'nkr',
 'nzdlr',
 'oat',
 'oilseed',
 'orange',
 'palladium',
 'palm-oil',
 'palmkernel',
 'pet-chem',
 'platinum',
 'potato',
 'propane',
 'rand',
 'rape-oil',
 'rapeseed',
 'reserves',
 'retail',
 'rice',
 'rubber',
 'rye',
 'ship',
 'silver',
 'sorghum',
 'soy-meal',
 'soy-oil',
 'soybean',
 'strategic-metal',
 'sugar',
 'sun-meal',
 'sun-oil',
 'sunseed',
 'tea',
 'tin',
 'trade',
 'veg-oil',
 'wheat',
 'wpi',
 'yen',
 'zinc']

Let's have a look at first 10 documents:

In [None]:
# print 10 sentences of the reuters corpus
for i, sent in enumerate(reuters.sents()[:10]):
  print("sent ", i, ":", " ".join(sent))

sent  0 : ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .
sent  1 : They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .
sent  2 : But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .
sent  3 : The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .
sent  4 : Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms sai

## Model building



In [None]:
from nltk import bigrams, trigrams

# trigrams
[x for x in trigrams("the price of petrol has dropped".split())]

[('the', 'price', 'of'),
 ('price', 'of', 'petrol'),
 ('of', 'petrol', 'has'),
 ('petrol', 'has', 'dropped')]

In [None]:
from nltk.corpus import reuters
from collections import Counter, defaultdict

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

**Let's see this in action...here is how our model would like:**

In [None]:
model

In [None]:
# predict the next word
dict(model["world", "markets"])

{',': 13,
 '.': 10,
 '."': 1,
 'and': 16,
 'at': 5,
 'continues': 1,
 'has': 1,
 'helped': 1,
 'in': 1,
 'while': 1}

In [None]:
# find the overall frequency of words in the corpus
counts = Counter(reuters.words())
total_count = len(reuters.words())
 
# relative frequencies
for word in counts:
    counts[word] /= float(total_count)
    
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

In [None]:
# predict the next word
dict(model["world", "markets"])

{',': 0.26,
 '.': 0.2,
 '."': 0.02,
 'and': 0.32,
 'at': 0.1,
 'continues': 0.02,
 'has': 0.02,
 'helped': 0.02,
 'in': 0.02,
 'while': 0.02}

## Inference - Text generation

In [None]:
# predict the next word
sorted(dict(model["today", "the"]).items(), key=lambda x: x[1], reverse=True)

[('company', 0.16666666666666666),
 ('price', 0.1111111111111111),
 ('public', 0.05555555555555555),
 ('European', 0.05555555555555555),
 ('Bank', 0.05555555555555555),
 ('emirate', 0.05555555555555555),
 ('overseas', 0.05555555555555555),
 ('newspaper', 0.05555555555555555),
 ('Turkish', 0.05555555555555555),
 ('increase', 0.05555555555555555),
 ('options', 0.05555555555555555),
 ('Higher', 0.05555555555555555),
 ('pound', 0.05555555555555555),
 ('Italian', 0.05555555555555555),
 ('time', 0.05555555555555555)]

In [None]:
sorted(dict(model["the", "price"]).items(), key=lambda x: x[1], reverse=True)

[('of', 0.3209302325581395),
 ('it', 0.05581395348837209),
 ('to', 0.05581395348837209),
 ('for', 0.05116279069767442),
 ('.', 0.023255813953488372),
 ('at', 0.023255813953488372),
 ('adjustment', 0.023255813953488372),
 ('is', 0.018604651162790697),
 (',', 0.018604651162790697),
 ('paid', 0.013953488372093023),
 ('increases', 0.013953488372093023),
 ('per', 0.013953488372093023),
 ('the', 0.013953488372093023),
 ('will', 0.013953488372093023),
 ('cut', 0.009302325581395349),
 ('cuts', 0.009302325581395349),
 ('(', 0.009302325581395349),
 ('differentials', 0.009302325581395349),
 ('has', 0.009302325581395349),
 ('stayed', 0.009302325581395349),
 ('was', 0.009302325581395349),
 ('freeze', 0.009302325581395349),
 ('increase', 0.009302325581395349),
 ('would', 0.009302325581395349),
 ('yesterday', 0.004651162790697674),
 ('effect', 0.004651162790697674),
 ('used', 0.004651162790697674),
 ('climate', 0.004651162790697674),
 ('reductions', 0.004651162790697674),
 ('limit', 0.004651162790697

In [None]:
import random

def gen_text():

  # starting words
  text = ["today", "the"]
  sentence_finished = False
  
  while not sentence_finished:
    # select a random probability threshold  
    r = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]

        # select words that are above the probability threshold
        if accumulator >= r:
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True
  
  print (' '.join([t for t in text if t]))

for i in range(5):
  gen_text()

today the public ," said Raymond Stone , chief economist of Schroeder , Muenchmeyer , Hengst ' s land assets are growing .
today the increase in subsidies .
today the increase reflects high costs of restructuring , or 18 . 64 DLRS
today the time of the split would be resolved relatively soon .
today the overseas markets ," said Judy Weissman , FCOJ , mostly in West Berlin , echoing comments from bankers and economists said .


### 3. Sentence probability

In [None]:
# find probability of a sentence

def sent_prob(sent):
  probs = []
  trigram_seq = [x for x in trigrams(sent.split())]
  for w1, w2, w3 in trigram_seq:
      probs.append(model[w1, w2][w3])
  return probs 

In [None]:
model["the", "price"]["of"]

0.3209302325581395

In [None]:
sent_prob("the price of oil has dropped")
# [x for x in trigrams("the price of oil has dropped".split())],sent_prob("the price of oil has dropped")

[0.3209302325581395, 0.04332129963898917, 0, 0]

In [None]:
sent_prob("the price of all has dropped")

[0.3209302325581395, 0.0036101083032490976, 0, 0]

In [None]:
sent_prob("oil and natural gas")


[0.10223642172523961, 0.9772727272727273]

In [None]:
sent_prob("owl and natural gas")

[0, 0.9772727272727273]

In [None]:
sent_prob("large price of stock")


[0, 0.0036101083032490976]

In [None]:
sent_prob("high price of stock")

[0.2, 0.0036101083032490976]