# Text Generation or What's the Next Word?

### Hey kagglers, Wassup? One more concept in my dictonary. 
> This notebook is all about text generation or you can get the next word of your word list using this model. Interesting, Isn't it?

Let's Start. This is a probabilistic model which will predict the next word of your text. Now you would have already guessed it that we can use it for text generation. Amazing!

The main concept is, if you already have start words then will predict the probability of the next word from our text corpus. Example - if your start word is 'Welcome to' then what will be your next word? So will choose the next word from our text corpus having highest probability given that it's previous two words are 'Welcome' and 'to'. Getting concept of probability?

### -------------------------- **Index** ----------------------------------
1. Data Loading.
2. Data Cleaning.
3. Model Development.
4. Testing.
5. What Next?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import gc

import matplotlib.pyplot as plt
import seaborn as sns

from nltk import ngrams
import re
from collections import defaultdict

### 1. Data Loading
I have randomly choosen the dataset so if you want to work with your choice of dataset the please go ahead and test it. Remember to fork my kernel. Hahaha.

This data is all about the chat between different users. For particular user id, there will be one message per row. There are so many columns in the dataset but for my purpose there are only two columns i.e, user id and text message. So Let's load the data, what are you waiting for? love me **** *** **.

In [None]:
df = pd.read_csv('../input/all-posts-public-main-chatroom/freecodecamp_casual_chatroom.csv',
                 usecols=['fromUser.id', 'text'])
print(df.head())
gc.collect()

In [None]:
# Renaming columns for my better understanding
df.rename(columns={'fromUser.id': 'id'}, inplace = True)

In [None]:
df.text = df.text.astype(str)
df.dtypes

Let's check top 10 users who have texted the most. Means, top 10 active users.

I will use the chat of the top active user.

In [None]:
id_count = df.id.value_counts().reset_index(drop=False).head(10)
#id_count.head()
plt.figure(figsize=(12, 8))
g = sns.barplot(x='index', y='id', data=id_count, color='green')
g.set_xticklabels(labels=id_count['index'], rotation=45)
plt.title('Top Active users')
plt.xlabel("User ID")
plt.ylabel("Count of Message")

In [None]:
df = df[df.id=='55b977f00fc9f982beab7883']
df.head()

### 2. Data Cleaning.
Data Cleaning is very crucial part of machine learning and in case of NLP it's must. So let's dig into the data.

Below are the data cleaning steps and this is not the hard and fast rule of data cleaning, you can clean the data according to your requirement. I am using **re** package of python. You can try **spacy** also. There are so many interesting nlp packages are available just explore and have fun.
* Lowercase the text
* Remove URL's
* Remove @ tags
* Remove # tags
* Remove everything except alphabets.
* Remove single letters
* strip multiple spaces between the text
* Strip left and right most spaces.

Thant's it. Keep in mind these steps work for me sequentially.

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', ' ', text)
    text = re.sub(r'@\S+', ' ', text)
    text = re.sub(r'#\S+', ' ', text)
    text = re.sub(r'[^a-z]', ' ', text)
    text = ' '.join(['' if len(word)<2 else word for word in text.split()])
    text = re.sub(r' +', ' ', text)
    text = text.strip()
    return text

In [None]:
df.text = df.text.map(clean_text)

In [None]:
df['word_count'] = df.text.apply(lambda text: len(text.split(' ')))

In [None]:
# Creating text corpus
text = df.text.str.cat(sep=' ')

Let's read starting 300 letters of text. Looks clean and clear text? Amazing!

In [None]:
print(len(text))
print(text[:300])

### 3. Model Development.
Basically we are generating trigrams and first two words of trigram will be keys input and 3rd word will be value for these keys.

If you don't know the concept of ngram or bigram and trigram then this paragraph is for you others can skip this. If your text is "I am a python developer and working in some abc company", then bigram will be [("I", "am"), ("am", "a"), ("a", "python").....], similary you can get trigram or ngram.

I am using nltk package for trigram. You can develop your own algorithm also.

In [None]:
trigram = ngrams(text.split(), n=3)

Here we are using defaultdictonay of collection package. Speciality of default dictionary is, it will not throw any error if you accessed the unknown key, it returns blank.

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))

for w1, w2, w3 in trigram:
    model[(w1, w2)][w3] += 1

Let's calculate the probability of 3rd word of trigram where we already know it's previous two words.

In [None]:
for key in model:
    total_count_sum = sum(model[key].values())
    total_count_sum = float(total_count_sum)
    for index in model[key]:
        model[key][index] = model[key][index]/total_count_sum

Here start is the staring two words of the text and then predicting next word. For the purpose of text generation we are randomly choosing next word from key value pair of trigram. If you only want to choose only one next word, then you can choose the word having highest probability.

In [None]:
start = ['sends', 'brownie']
stop = False
first = True
generated_text = start

while not stop:
    if first:
        # Covers only initial scenario.
        try:
            words = list(model[start[0], start[1]].keys())
            random = np.random.randint(len(words))
            word = words[random]
            # print(word)
        except Exception as e:
            print(str(e))
            break
        first = False
        generated_text.append(word)
    start = generated_text[-2:]
    if len(start)==0:
        stop=True
    try:
        words = list(model[start[0], start[1]].keys())
        random = np.random.randint(len(words))
        word = words[random]
    except Exception as e:
        print(str(e))
        break

    generated_text.append(word)

    if len(generated_text) > 100:
        break

### 4. Testing
Let's check the output and interpret. Model has genmerated preety cool output. Isn't it? you can use more input data for this model, so that you can get better result maybe.

In [None]:
final_text = ' '.join(generated_text)

print(final_text)

In [None]:
gc.collect()

#### Tip - Most probable next word.

In [None]:
start = ['welcome', 'to']

max(model[start[0], start[1]], key=model[start[0], start[1]].get)

### 5. What's Next?
This is very basic langauage model, you can use modern models like RNN, LSTM, GRU etc to generate texts. These modern models considers grammer, contexts etc.

Disadvantage of this model is it is not very flexible model and works with medium quantity of data.