# Task

## Requirements
1. Do text preprocessing (e.g., stopword removal, lemmatization, stemming, etc.)
2. TF-IDF text representation
3. Run LDA
4. Identify the optimal number of topics
5. Show top 10 words for each topic.


## Data
The dataset consists 2885 datasets information in 15 columns:

- Title
- Subtitle
- Owner
- Vote
- Last update
- Tags
- Datatype
- Size
- License
- Views
- Downloads
- Kernels
- Topics
- URL
- Description


# Approach

One can see that the variables title, subtitle as well as description can be used for topic modeling. Therefore we start with title, go on with subtitle and finish with description to get the best possible topics out of the dataset.

# Step-by-Step

## Preparation
### Load Packages & Data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
pd.options.display.max_rows = 10
d = pd.read_csv("dataset.csv") 
d
d2 = d # backup for testing

In [None]:
# d.shape
d.info()

# "Title" Analysis

## Text Preprocessing

In [None]:
# Load the regular expression library
import re

# Remove unnessesary characters - simple form of lemmatization
d['title_preprocessed'] = d['Title'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
d['title_preprocessed'] = d['title_preprocessed'].map(lambda x: x.lower())

# Print out the first rows
d['title_preprocessed'].head() # success

In [None]:
# check interim result 

# Import the wordcloud library
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(d['title_preprocessed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000,
                      contour_width=3, contour_color='steelblue',
                      height=400, width = 800)

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

In [None]:
# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
# stop. append("it")
print(stop)
# title_no_stop_words.values.apply(lambda x: [item for item in x if item not in stop]) # first unsuccessful approach
# d['title_preprocessed'] = pd.Series([word for word in d['title_preprocessed'] if word not in stop]) # second unsuccesful approach
d['title_preprocessed'] = d['title_preprocessed'].str.lower().str.split() # old preprocessing step for following solution 
d['title_preprocessed'] = d['title_preprocessed'].apply(lambda x: [item for item in x if item not in stop])

In [None]:
# check result
print(list(d['Title'][15:16]))
print(list(d['title_preprocessed'][15:16])) 
# success

In [None]:
# perform stemming
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
temp_list = []
for words in list(d['title_preprocessed']):
    temp_list.append(stemming.stem(" ".join(words)))
    
    # print(words)
    for word in words:
        stemming.stem(word)
        # print(word)
        
d['title_preprocessed'] = pd.Series(temp_list)

In [None]:
# check result
print(d['Title'][:10])
print("\n"*2)
print(d['title_preprocessed'][:10]) 
# success

## TF-IDF Transformation

In [None]:
d_backup = d

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their TFIDF score.

In [None]:
# tfidf transformation
tfidf = tfidf_vectorizer.fit_transform(d['title_preprocessed'])
tfidf.data[:50]

In [None]:
lda = LatentDirichletAllocation(n_components=3, max_iter=5, # LDA, n_components = number of topics to be found
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=22) # set seed
                                # doc_topic_prior = alpha = 0.01 per default
                                # topic_word_prior = beta

lda.fit(tfidf)

lda.components_

every value above is connected to a word below (e. g. 8.70.... to '10')

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names[:100])

## print out top 10 words

In [None]:
for topic_idx, topic in enumerate(lda.components_):
    top_words = [tfidf_feature_names[i] for i in topic.argsort()[:-20-1:-1]] # -10 = last 10 values & 
    print('Topic:',topic_idx,'--',top_words) # -1 because python starts at 0

Topic guessing:

- Topic 0: Crime in India, e. g. https://www.kaggle.com/rajanand/crime-in-india
- Topic 1: Predict Crypto Price Trends, e. g. https://cointelegraph.com/explained/how-to-predict-crypto-price-trends-explained
- Topic 2: Wedding Announcements, e.g. https://www.nytimes.com/2019/05/26/fashion/weddings/this-weeks-wedding-announcements.html

## Text Preprocessing

In [None]:
# Load the regular expression library
import re

# Remove unnessesary characters - simple form of lemmatization
d['title_preprocessed'] = d['Title'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
d['title_preprocessed'] = d['title_preprocessed'].map(lambda x: x.lower())

# Print out the first rows
d['title_preprocessed'].head() # success

In [None]:
# check interim result 

# Import the wordcloud library
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(d['title_preprocessed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000,
                      contour_width=3, contour_color='steelblue',
                      height=400, width = 800)

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

In [None]:
# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
# stop. append("it")
print(stop)
# title_no_stop_words.values.apply(lambda x: [item for item in x if item not in stop]) # first unsuccessful approach
# d['title_preprocessed'] = pd.Series([word for word in d['title_preprocessed'] if word not in stop]) # second unsuccesful approach
d['title_preprocessed'] = d['title_preprocessed'].str.lower().str.split() # old preprocessing step for following solution 
d['title_preprocessed'] = d['title_preprocessed'].apply(lambda x: [item for item in x if item not in stop])

In [None]:
# check result
print(list(d['Title'][15:16]))
print(list(d['title_preprocessed'][15:16])) 
# success

In [None]:
# perform stemming
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
temp_list = []
for words in list(d['title_preprocessed']):
    temp_list.append(stemming.stem(" ".join(words)))
    
    # print(words)
    for word in words:
        stemming.stem(word)
        # print(word)
        
d['title_preprocessed'] = pd.Series(temp_list)

In [None]:
# check result
print(d['Title'][:10])
print("\n"*2)
print(d['title_preprocessed'][:10]) 
# success

## TF-IDF Transformation

In [None]:
d_backup = d

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their TFIDF score.

In [None]:
# tfidf transformation
tfidf = tfidf_vectorizer.fit_transform(d['title_preprocessed'])
tfidf.data[:50]

In [None]:
lda = LatentDirichletAllocation(n_components=3, max_iter=5, # LDA, n_components = number of topics to be found
                                learning_method='online', 
                                learning_offset=50., 
                                random_state=22) # set seed
                                # doc_topic_prior = alpha = 0.01 per default
                                # topic_word_prior = beta

lda.fit(tfidf)

lda.components_

every value above is connected to a word below (e. g. 8.70.... to '10')

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names[:100])

## print out top 20 words
One choose 20 instead of 10 words for better topic name finding.

In [None]:
for topic_idx, topic in enumerate(lda.components_):
    top_words = [tfidf_feature_names[i] for i in topic.argsort()[:-20-1:-1]] # -10 = last 10 values & 
    print('Topic:',topic_idx,'--',top_words) # -1 because python starts at 0

Topic guessing:

- Topic 0: Crime in India, e. g. https://www.kaggle.com/rajanand/crime-in-india
- Topic 1: Predict Crypto Price Trends, e. g. https://cointelegraph.com/explained/how-to-predict-crypto-price-trends-explained
- Topic 2: Wedding Announcements, e.g. https://www.nytimes.com/2019/05/26/fashion/weddings/this-weeks-wedding-announcements.html

# "Description" Analysis

## Text Preprocessing

In [None]:
# Load the regular expression library
import re

# Remove unnessesary characters - simple form of lemmatization
d['description_preprocessed'] = d['Description'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
d['description_preprocessed'] = d['description_preprocessed'].map(lambda x: x.lower())

# Print out the first rows
d['description_preprocessed'].head() # success

'Title' handling approach is not working. We must use an own approach for transforming the 'Description' variable. 

In [None]:
# check single item to get problematic patterns
list(d['Description'][2:3])

Problematic patterns are:
- \r\n: meaning end of line in Windows
- Hyperlinks could be problematic, too 

We ignore both patterns for the moment.


Check for missing values


## Text Preprocessing

In [None]:
# create new variable and lower all text
d['description_preprocessed'] = d['Description'].str.lower()

In [None]:
type(d['description_preprocessed'][1])
d['description_preprocessed'][1][:200]

In [None]:
d['description_preprocessed'].isnull().values.any()

In [None]:
# get number of rows with NAs in 'Description'
d['description_preprocessed'].isnull().sum().sum()

In [None]:
# show rows with NAs in 'Description'
nan_rows = d[d['description_preprocessed'].isnull()]
nan_rows

In [None]:
# remove all rows with NAs in variable 'Description'
d = d.dropna(subset = ['description_preprocessed'])

In [None]:
# check result
print(d['Description'][:10])
print("\n"*2)
print(d['description_preprocessed'][:10]) 
# success

In [None]:
data = list(d['description_preprocessed'])
# data = data[:500]

In [None]:
data[:1]

In [None]:
len(data)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
# nltk.download('punkt')

porter=PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

data_temp = []
for sentence in data[:2875]:
    # x = stemSentence(sentence)
    data_temp.append(stemSentence(sentence))
# sentence = data[1]
# x=stemSentence(sentence)
# print(data_temp)
data = data_temp
len(data_temp)

In [None]:
data[:2]

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000,
                                   stop_words='english',
                                   encoding = 'utf-8',
                                   decode_error = 'replace')

In [None]:
# tfidf transformation
tfidf = tfidf_vectorizer.fit_transform(data)
tfidf.data[:50]

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5, # LDA, n_components = number of topics to be found
                                learning_method='online', # 
                                learning_offset=50., 
                                random_state=0) # set seed
                                # doc_topic_prior = alpha = 0.01 per default
                                # topic_word_prior = beta

lda.fit(tfidf)

lda.components_

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names[:100])

## print out top 20 words

In [None]:
for topic_idx, topic in enumerate(lda.components_):
    top_words = [tfidf_feature_names[i] for i in topic.argsort()[:-20-1:-1]] 
    print('Topic:',topic_idx,'--',top_words) 

## evaluate based on perplexity

In [None]:
pscores = []
# n_topics = range(1, 31) #used for testing single outcomes. 5 steps are sufficient
for n_topic in [2, 3, 4, 5, 10, 15, 20, 30]:
    lda = LatentDirichletAllocation(n_components=n_topic, max_iter=5,random_state=7)

    lda.fit(tfidf)

    perplexity_score = lda.perplexity(tfidf)
    print(perplexity_score)
    pscores.append(perplexity_score)

# pscores
# perplexity score of 0 is best value

In [None]:
## plot the perplexity score with n_topics
import matplotlib.pylab as plt
plt.plot([2, 3, 4, 5, 10, 15, 20, 30],pscores,'r+--')
plt.xlabel('# of topics')
plt.ylabel('Perplexity score')
plt.show()

Hint: le245 = 10^245

**Intepretation: up to 20 topics are reasonable, but no more. We use 5 topics as a heuristic and try to determine the topic names.**

In [None]:
lda_final = LatentDirichletAllocation(n_components=5, max_iter=5, 
                                learning_method='online', # 
                                learning_offset=50., 
                                random_state=0) # set seed

lda_final.fit(tfidf)

lda_final.components_

In [None]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print(tfidf_feature_names[:100])

## print out top 20 words
One choose 20 instead of 10 words for better topic name finding.

In [None]:
for topic_idx, topic in enumerate(lda_final.components_):
    top_words = [tfidf_feature_names[i] for i in topic.argsort()[:-20-1:-1]] 
    print('Topic:',topic_idx,'--',top_words) 

#### Interpretation of results

Results could show be the following topics:
- Topic 1 (index 0): IMDB Movie Reviews, e. g. article [IMDB Movie Reviews Dataset](https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset)
- Topic 2 (index 1): Text Processing with Python, e. g. article [Machine Learning — Text Processing – Towards Data Science](https://towardsdatascience.com/machine-learning-text-processing-1d5a2d638958)
- Topic 3 (index 2): Pokemon behaviour, e. g. article [Pokemon Moves](https://www.poke-verse.com/pokemon-moves/)
- Topic 4 (index 3): Document Clustering, e. g. article [Self-Tuned Descriptive Document Clustering](http://pcwww.liv.ac.uk/~goulerma/publications/descr-clust_preprint_full.pdf)
- Topic 5 (index 4): Free Data Sets for Data Science Projects, e. g. article [Free Data Sets for Data Science Projects – Dataquest](https://www.dataquest.io/blog/free-datasets-for-projects/)

# outlook

One could perform additional or alternative text preprocessing transformations such as
    
- lemmatization (as alternative to stemming)
- remove URLs
- remove punctuation
- remove pattern '\r\n', e. g. using regex

to check improvement on LDA outcome.

# Submission Format

Please submit one python Jupyter notebook file with the filename following: YourFirstName_YourLastName.ipynb