Libraries to pip install 

!pip install matplotlib
!pip install spacy
!pip install pandas
!pip install seaborn
!pip install nltk
!pip install scikit-learn
!pip install pyLDAvis
!pip install numpy==1.26.4

import nltk
nltk.download('all')

# **Pre-Processing data**
## *Libraries to import*
```Python
from nltk.corpus import stopwords #stopwords
from gensim.parsing.porter import PorterStemmer #stemming
From nltk.stem import PorterStemmer #stemming
from nltk.stem.wordnet import WordNetLemmatizer #lemmatization

porter_stemmer = PorterStemmer() #stemming
lemma=WordNetLemmatizer() #lemmatization
stopwordss=stopwords.words('english') #list of stopwords
exclude=set(string.punctuation) #punctuations
```

## *Cleaning*
```Python
def clean(doc):
    punc_free = ''.join([ch for ch in doc.lower() if ch not in exclude]) #remove punctuations
    stop_free = ' '.join([i for i in punc_free.split() if i not in stopwordss]) #remove stopwords
    normalized = ' '.join(lemma.lemmatize(word) for word in stop_free.split()) #lemmatisation
    stemmed = ' '.join(porter_stemmer.stem(word) for word in normalized.split()) #stemming
    return normalized

hotel_reviews["preprocessed_review"] = hotel_reviews['Review'].apply(lambda x : clean(x).split()) #pandas dataframe
doc_clean = [clean(doc).split() for doc in corpus] #list of documents given
```
Adding domain specific stopwords 
```Python
#addon to stop words
domain_stop = ["said", "mr"]
stopwordss.update(domain_stop)
```

# **Visualising Text Data (EDA)**
## *Frequency of Words: Frequency Distribution / WordCloud*
```Python
from nltk.probability import FreqDist
import matplotlib.pyplot as plt
from wordcloud import WordCloud
reviews=hotel_reviews['Review'].tolist() #convert dataframe columns to list
freq_dist=FreqDist(reviews_cleaned) #calculate frequency of words that should be a list
freq_dist.plot(50, cumulative=False) #generate freq distribution plot 
cloud=WordCloud().generate_from_frequencies(freq_dist) #generate word cloud
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```


## *Given List, to calculate frequency of words: tokenize words before doing anything else*
```Python
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
all_words = [word.lower() for sent in df.text for word in word_tokenize(sent)]
all_words_frequency = FreqDist(all_words)
```


## *Distribution of Sentiments*
```Python
#group ratings
import matplotlib.pyplot as plt
plt.figure()
pd.value_counts(hotel_reviews['Rating']).plot.bar(title="Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Number of rows")
plt.show()
```


## *POS and NER Tagging*
- Need to tokenise the words first, before using POS-tagger in NLTK library for word tags
- ** tokenize then pos tagging 
```Python
def tagPOS(text):
       wordsList = nltk.word_tokenize(text)  # Word tokenizers is used to find the words 
    tagged = nltk.pos_tag(wordsList)      #  Using a Tagger. Which is part-of-speech tagger or POS-tagger.  
    return tagged
df['POS_News'] = df['Text'].apply(lambda x: tagPOS(x))  #use lambda to apply tagPOS to each review
df.head()
```

## *Finding specific tag from tag text*
**find the top N words based on the POS tag**  
```Python
def findtags(tag_prefix, tagged_text, n):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(n)) for tag in cfd.conditions())
```
**find the top 5 adjective in the first news**
```Python
tagged_text = df['POS_News'][0]
tagdict = findtags('JJ', tagged_text, 5)
for tag in sorted(tagdict):
    print(tag, tagdict[tag])
```


## *Named Entity Recognition (NER): identifies name entity*
```Python
from nltk import ne_chunk
nltk.download('words')
nltk.download('maxent_ne_chunker')
res_chunk = ne_chunk(tagged_text) #NER chunking
for x in str(res_chunk).split('\n'):
    if '/NN' in x:
        print(x) #print tags with Noun NN
```

# **Sentiment Classification**

## *Feature Extraction*
- use top-N words feature
### *Fetching words from corpus*
```Python
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
all_words = [word.lower() for sent in df.text for word in word_tokenize(sent)
# print first 10 words
print (all_words[:10])
```

### *Create frequency distribution of words: calculate occurences of each word in entire list of words* 
```Python
from nltk import FreqDist
all_words_frequency = FreqDist(all_words)
print (all_words_frequency)
# print 10 most frequently occurring words
print (all_words_frequency.most_common(10))
```



## *Create Word Feature using 2000 most frequently occurring words*
We take 2000 most frequently occurring words as our feature.
```Python
print (len(all_words_frequency)) 
 
# get 2000 frequently occuring words
most_common_words = all_words_frequency.most_common(2000) #using nltk.FreqDist.most_common() to get the frequently occurring words

# the most common words list's elements are in the form of tuple get 
# only the first element of each tuple of the word list
word_features = [item[0] for item in most_common_words]
print (word_features[:10])
```

## *Create Feature Set*
- apply text preprocessing through loops for the reviews
```Python
df['text'] = df['text'].apply(lambda x: word_tokenize(x.lower()))
df['text'] = df['text'].apply(lambda x: clean(x))  
df.head()
```
- create feature set to train classifier: checks if words in given document are present in word_features_list or not
```Python
def document_features(df, stemmed_tokens):
    doc_features = []
    for index, row in df.iterrows():
        features = { }
        for word in word_features:
            # get term occurence: true if it's in the word_features, false if it's not
            features[word] = (word in row[stemmed_tokens])
        doc_features.append(features)
    return doc_features
feature_set = pd.DataFrame(document_features(df, 'text'), index = df.index)
feature_set.head()
```


## *Training Classifier*
*Creating Training and Test Set*
```Python
import seaborn as sns
from sklearn.model_selection import train_test_split

X = feature_set
y = df[df.columns[-1:]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print (y_train.sentiment.value_counts(normalize=True))

#plot chart
plt.style.use('ggplot')
plt.figure(figsize=(6,4))
sns.countplot(data=y_train, x='sentiment')
```

*Training Classifier*
```Python
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
#use decision tree in this case
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)

# classification report
print(classification_report(y_test, classifier.predict(X_test)))

# accuracy score
y_pred = classifier.predict(X_test)
print("Accuracy Score: " + str(accuracy_score(y_test, y_pred)))
```
*Function a confusion matrix*
```Python
def conf_matrix(y_test, pred_test):    
    # Creating a confusion matrix
    con_mat = confusion_matrix(y_test, pred_test)
    con_mat = pd.DataFrame(con_mat, range(2), range(2))
    #Ploting the confusion matrix
    plt.figure(figsize=(6,6))
    sns.set(font_scale=1.5) 
    sns.heatmap(con_mat, annot=True, annot_kws={"size": 16}, fmt='g', cmap='Blues', cbar=False)
    
#Implementing the confusion matrix
conf_matrix(y_test, y_pred)
```
*Testing classifier with custom reviews*
```Python
# Negative review correctly classified as negative 
# Positive review is classified as negative
data = {'custom_review': ['I hated the film. It was a disaster. Poor direction, bad acting.', 
                          'It was a wonderful and amazing movie. I loved it. Best direction, good acting.']}

df_test = pd.DataFrame (data, columns = ['custom_review'])
df_test['custom_review'] = df_test['custom_review'].apply(lambda x: word_tokenize(x.lower()))
df_test['custom_review'] = df_test['custom_review'].apply(lambda x: clean(x))

test_features = pd.DataFrame(document_features(df_test, 'custom_review'), index = df_test.index)
print (classifier.predict(test_features))
```


## *Bag of Words using TF-IDF Feature*
- create dictionary of unique words and calculate term weights for text feature.  
*Creating Bag of Words features*
```Python
import gensim
from gensim import corpora
# Build the dictionary
mydict = corpora.Dictionary(df['text'])
vocab_len = len(mydict)
def get_bow_features(df, stemmed_tokens): #create term frequency features
    test_features = []
    for index, row in df.iterrows():
        # Converting the tokens into the format that the model requires
        features = gensim.matutils.corpus2csc([mydict.doc2bow(row[stemmed_tokens])],num_terms=vocab_len).toarray()[:,0]
        test_features.append(features)
    return test_features

header = ",".join(str(mydict[ele]) for ele in range(vocab_len))

bow_features = pd.DataFrame(get_bow_features(df, 'text'), columns=header.split(','), index = df.index)
bow_features.head()

#CREATE TERM WEIGHTS WITH TF-IDF
import gensim
from gensim import corpora
from gensim.models import TfidfModel

# Build the dictionary
mydict = corpora.Dictionary(df['text'])
vocab_len = len(mydict)
corpus = [mydict.doc2bow(line) for line in df['text']]
tfidf_model = TfidfModel(corpus)

def get_tfidf_features(df, stemmed_tokens):
    test_features_tfidf = []
    for index, row in df.iterrows():
        doc = mydict.doc2bow(row[stemmed_tokens])
        # Converting the tokens into the formet that the model requires
        features = gensim.matutils.corpus2csc([tfidf_model[doc]], num_terms=vocab_len).toarray()[:,0]
        test_features_tfidf.append(features)
    return test_features_tfidf

header = ",".join(str(mydict[ele]) for ele in range(vocab_len))

tfidf_features = pd.DataFrame(get_tfidf_features(df, 'text'),                            
                            columns=header.split(','), index = df.index)
tfidf_features.head()
```

## *Training Classifier + Accuracy calculation with TFIDF Feature Set*
```Python
X = tfidf_features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
from sklearn.naive_bayes import GaussianNB
#using decision tree
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)
# classification report
print(classification_report(y_test, classifier.predict(X_test)))
# accuracy score
y_pred = classifier.predict(X_test)
print("Accuracy Score: " + str(accuracy_score(y_test, y_pred)))
```


## *Classifiers*
**Decision Tree** 
```Python 
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)
```

**Logistic Regression**
```Python
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(solver='lbfgs')

```
**Naive Bayes**
```Python
from sklearn.naive_bayes import GaussianNB #Gaussian Naive Bayes 
clf = GaussianNB()
clf.fit(X, Y)
```

## *Checking importance of features among entire features in feature set*
```Python
#Find out the most important features from the classification model
importances = list(classifier.feature_importances_)
feature_importances = [(feature, round(importance, 10)) for feature, importance in zip(tfidf_features.columns, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

top_i = 0
for pair in feature_importances:
    print('Variable: {:10} Importance: {}'.format(*pair))
    if top_i == 20:
        break
    top_i += 1
```

# **Topic Modelling**
## *Import Libraries*
```Python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import gensim #for gensim LDA model
from gensim import corpora
import string
from pathlib import Path
```
<<!Pre-Processing Here>>  


## *Preparing word representation*
- use gensim library to do term frequency word representation
```Python
dictionary = corpora.Dictionary(doc_clean) #use gensium corpora to create data structure keeping all unique words
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] #use dictionary to create doc-term matrix for each of doc / file using bag of words approach
```


## *Creating LDA model*
- uses gensim lda models to set value of 5 for the first model to specify the number of topics for LDA
```Python
topic_num = 5 #no. Of topics is 5
word_num = 5 # no. of words in topics is 5 

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics = topic_num, id2word = dictionary, passes=20)

pprint(ldamodel.print_topics(num_topics=topic_num, num_words=word_num))
# Compute Perplexity
print('Perplexity: ', ldamodel.log_perplexity(doc_term_matrix)) #need to write perplexity, important for practical test 
```

Some pointers: 
- the results and topics are often difficult to identify a category and may not be meaningful
- how do you determine a suitable number to use?
    - use perplexity value - statistical measure of how well probability model predicts sample. benefit comes when comparing different LDA models and model with lower - perplexity value is considered better, 
    - Perplexity value is controlled by topic num and word num, What is a good enough perplexity value? (rmb to plot, ) 
- **increasing topic_num** to a large # **MAY NOT HELP** in understanding categories (unless prior knoweledge of possible large value), thus sacrificing clarity




## *Retrieving topic details*
- Part 1: find out file name and corresponding topic ids with probability. Given that LDA is probability in modelling mixture of topics on given content, LDA assign topic ids with probability to indicate content can potentially has more than topic. FILES TO TOPIC.
```Python
print('\nFile name and its corresponding topic id with probability:')
dic_topic_doc = {}
for index, doc in enumerate(doc_clean):
    #for doc in doc_clean:
    bow = dictionary.doc2bow(doc)
    #get topic distribution of the ldamodel
    t = ldamodel.get_document_topics(bow)
    #sort the probability value in descending order to extract the top contributing topic id
    sorted_t = sorted(t, key=lambda x: x[1], reverse=True)
    #print only the filename 
    print(filenames[index],sorted_t)
    #get the top scoring item
    top_item = sorted_t.pop(0)
    #create dictionary and keep key as topic id and filename and probability in tuple as value
dic_topic_doc.setdefault(top_item[0],[]).append((filenames[index],top_item[1]))
```

- Part 2: Making use of the above information, and transform to extract list of topic id, number of files (belong to topic) and list of file names with probability. TOPICS TO FILES. (both works)
```Python 
#print out identified topic id and associated
print('\nTopic id, number of documents, list of documents with probability and represented topic words:')

for key,value in dic_topic_doc.items():
    sorted_value = sorted(value, key=lambda x: x[1], reverse=True)
    print(key,len(value),sorted_value)
    #print the topic word and most represented doc
    print(ldamodel.print_topic(key,word_num))
```
The interpretation of the result, based on the below output: 
> 0 13 [('206.txt', 0.99757373), ('112.txt', 0.99581325), ('221 .txt', 0.99573374) … 
> 0.005*"said" + 0.005*"network" + 0.005*"business" + 0.004*"uk" + 0.004*"could"
means that topic id 0 has 13 files identified and 206.txt is assigned with the highest probability, followed by 112.txt and so on. Python starts its index with 0 but essentially, topic id 0 is the first topic identified.



## *Visualize Topics and Keywords*
- using pyLDAvis to visualise fit of model across topics and top words
- Plotted at PCA (Principal Component Analysis) with doc-term matrix. PCA ensures dimension reduction. 
- The overlapping topics, would imply that there are similar words in overlaps. Which is not distinctive topics, hence choose a good alpha, beta. To improve this, increase the topic numbers! To have more topic numbers, to have higher cluster bubbles.
    - Challenge: the more the # of topics may be more distinct, but there are some issues with similar words too! Issues with more topic clusters in LDA: 
- Perplexity lower the better but until what value? 

> pip install pyLDAvis

```Python
# plotting tools
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline

# visualize the topics and keywords
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary)
vis
```

## *Plotting Perplexity against # of Topics* 
```Python
# setting ranges
topic_range = range(2, 21)
perplexity_values = []
dictionary = corpora.Dictionary(corpus_text) #use gensium corpora to create data structure keeping all unique words
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus_text] #use dictionary to create doc-term matrix for each of doc / file using bag of words approach    

#Train models and calculate perplexity
for num_topics in topic_range:
    Lda = gensim.models.ldamodel.LdaModel
    ldamodel = Lda(doc_term_matrix, num_topics = topic_num, id2word = dictionary, passes=20)
    perplexity = ldamodel.log_perplexity(doc_term_matrix)
    perplexity_values.append(perplexity)
    print(f'num_topics: {num_topics}, perplexity: {perplexity}')

# Plotting perplexity
plt.figure(figsize=(10, 6))
plt.plot(topic_range, perplexity_values, marker='o')
plt.title('Perplexity vs Number of Topics')
plt.xlabel('Number of Topics')
plt.ylabel('Perplexity')
plt.grid()
plt.show()
```