## Student Name: Vamsi Thokala
## Student Email: Vamsi.thokala-1@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | summary | keywords|
| -- | -- | -- | -- | -- | -- | -- |
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| test | test |

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

## Cleaning Up PDFs

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


###Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

#### Add the cleaned text to the structure you created.

In [None]:
%pip install PyPDF2
%pip install nltk
%pip install matplotlib
%pip install tabulate
%pip install joblib
%pip install gensim


In [1]:

import PyPDF2
import os
import pandas as pd
import numpy as np
import nltk
import joblib
from tabulate import tabulate
from PyPDF2 import PdfReader
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.sparse import csr_matrix
import warnings
warnings.filterwarnings("ignore")
import logging
logging.getLogger("PyPDF2").setLevel(logging.ERROR)


In [2]:
nltk.download("stopwords")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vamsithokala/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vamsithokala/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:

def read_pdf(file_path):
    pdf = PdfReader(open(file_path, "rb"))
    text = " ".join([page.extract_text() for page in pdf.pages])
    return text

def preprocess_text(text, extra_stopwords):
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [word for word in words if word.isalpha()]
    words = [word for word in words if word not in stopwords.words("english") + extra_stopwords]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(words)


# Add your extra stopwords here
extra_stopwords = ["city", "state", "smart", "page"]

data = []
min_text_length = 20  # Set a threshold for minimum text length
removed_cities_count = 0
removed_cities = []  # To store the removed cities and their issues
directory = "smartcity/"
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        city = filename.split(".")[0]
        raw_text = read_pdf(os.path.join(directory, filename))
        clean_text = preprocess_text(raw_text, extra_stopwords)

        # Check if the cleaned text is above the threshold and limit the number of removed cities
        if len(clean_text) > min_text_length or removed_cities_count >= 15:
            data.append([city, raw_text, clean_text])
        else:
            issue = f"Text length ({len(clean_text)}) is below the threshold ({min_text_length})"
            removed_cities.append((city, issue))
            removed_cities_count += 1

# Display the removed cities and their issues
for city, issue in removed_cities:
    print(f"Removed city: {city}, Issue: {issue}")

df = pd.DataFrame(data, columns=["city", "raw text", "clean text"])







Removed city: OH Toledo, Issue: Text length (0) is below the threshold (20)
Removed city: CA Moreno Valley, Issue: Text length (0) is below the threshold (20)
Removed city: TX Lubbock, Issue: Text length (0) is below the threshold (20)
Removed city: NV Reno, Issue: Text length (0) is below the threshold (20)
Removed city: FL Tallahassee, Issue: Text length (0) is below the threshold (20)


### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

Removed city: OH Toledo, Issue: Text length (0) is below the threshold (20)
Removed city: CA Moreno Valley, Issue: Text length (0) is below the threshold (20)
Removed city: TX Lubbock, Issue: Text length (0) is below the threshold (20)
Removed city: NV Reno, Issue: Text length (0) is below the threshold (20)
Removed city: FL Tallahassee, Issue: Text length (0) is below the threshold (20)

The text extraction process might have failed for these specific documents.

#### Explain what additional text processing methods you used and why.

    preprocess_text() function takes in the raw text and a list of extra stopwords as arguments.
    The text is converted to lowercase to maintain uniformity and to match words in the stopwords list.
    The text is tokenized into words using the nltk.word_tokenize() function.
    Only alphabetic words are retained, removing any numbers or special characters.
    Stopwords, including the extra stopwords provided, are removed from the list of words.
    Word lemmatization is performed using the WordNetLemmatizer() class from NLTK, which reduces words to their base form (lemma), improving the clustering process by grouping similar words together.


#### Did you identify any potientally problematic words?

["city", "state", "smart", "page"]. These words are considered potentially problematic because they are common words that may appear frequently in the documents but do not provide any valuable information for clustering or topic modeling. Removing these words from the text helps to focus on more meaningful words and n-grams that better represent the content of the documents.

## Experimenting with Clustering Models

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|--|
|Hierarchical |--|--|--|--|
|DBSCAN | X | X | X | -- |



In [4]:

# Prepare the data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["clean text"])
X_dense = X.toarray()

# Initialize variables
k_values = list(range(2, 51))
results = {
    "K-means": {},
    "Hierarchical": {},
    "DBSCAN": {},
}

optimal_k = {
    "K-means": {"k": 0, "silhouette_score": -1},
    "Hierarchical": {"k": 0, "silhouette_score": -1},
}

# Evaluate clustering models
for k in k_values:
    # K-means clustering
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans_labels = kmeans.fit_predict(X)
    kmeans_silhouette = silhouette_score(X, kmeans_labels)
    kmeans_calinski_harabasz = calinski_harabasz_score(X_dense, kmeans_labels)
    kmeans_davies_bouldin = davies_bouldin_score(X_dense, kmeans_labels)
    results["K-means"][k] = [kmeans_silhouette, kmeans_calinski_harabasz, kmeans_davies_bouldin]

    if kmeans_silhouette > optimal_k["K-means"]["silhouette_score"]:
        optimal_k["K-means"]["k"] = k
        optimal_k["K-means"]["silhouette_score"] = kmeans_silhouette

    # Hierarchical clustering
    hierarchical = AgglomerativeClustering(n_clusters=k)
    hierarchical_labels = hierarchical.fit_predict(X_dense)
    hierarchical_silhouette = silhouette_score(X, hierarchical_labels)
    hierarchical_calinski_harabasz = calinski_harabasz_score(X_dense, hierarchical_labels)
    hierarchical_davies_bouldin = davies_bouldin_score(X_dense, hierarchical_labels)
    results["Hierarchical"][k] = [hierarchical_silhouette, hierarchical_calinski_harabasz, hierarchical_davies_bouldin]

    if hierarchical_silhouette > optimal_k["Hierarchical"]["silhouette_score"]:
        optimal_k["Hierarchical"]["k"] = k
        optimal_k["Hierarchical"]["silhouette_score"] = hierarchical_silhouette

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
unique_labels = np.unique(dbscan_labels)

if len(unique_labels) > 1:
    dbscan_silhouette = silhouette_score(X, dbscan_labels)
    dbscan_calinski_harabasz = calinski_harabasz_score(X_dense, dbscan_labels)
    dbscan_davies_bouldin = davies_bouldin_score(X_dense, dbscan_labels)
    results["DBSCAN"] = {"X": [dbscan_silhouette, dbscan_calinski_harabasz, dbscan_davies_bouldin]}
else:
    results["DBSCAN"] = {"X": ["Not enough clusters", "Not enough clusters", "Not enough clusters"]}


print(results)



{'K-means': {2: [0.02770128465494533, 1.3158572530700587, 2.6636799594572254], 3: [0.01121621185016327, 1.2902488017744327, 4.3996535250427895], 4: [0.022162328255323067, 1.281477560379151, 1.744814589041059], 5: [0.02446750543661035, 1.283371997436961, 1.8026921615592957], 6: [-0.012313899118486302, 1.201426384937001, 1.1958985348125328], 7: [0.012274128628564094, 1.2226321056952096, 3.1172460205855788], 8: [0.011398520592593941, 1.208362992796128, 2.2499310096849805], 9: [0.0023772027586678726, 1.197555593073037, 2.698205182298869], 10: [0.008167119588534185, 1.2415145152388947, 1.9030219267197233], 11: [0.016183491993321378, 1.1980339529979003, 1.918036555761922], 12: [0.006970353078722772, 1.2055996420430035, 1.6782926867109855], 13: [-0.011306603213943726, 1.191558639123327, 1.3215828868016126], 14: [-0.0024697044251542556, 1.2053905594266479, 2.161152066837832], 15: [-0.0005339231983738511, 1.1747747092204457, 1.2884685877567226], 16: [-0.004127668833803753, 1.1788036718029997, 1

In [5]:

def print_results_table(results, optimal_k):
    header = ["Algorithm"]
    k_values = sorted(list(results["K-means"].keys()))
    header.extend([f"k = {k}" for k in k_values])
    header.append("Optimal k")
    rows = []

    for algorithm in results.keys():
        if algorithm == "DBSCAN":
            row = [algorithm]
            for k in k_values:
                row.append("X")
            row.append("--")
        else:
            row = [algorithm]
            for k in k_values:
                silhouette_score = results[algorithm][k][0]
                row.append(round(silhouette_score, 4))
            row.append(optimal_k[algorithm]["k"])
        rows.append(row)

    table = tabulate(rows, headers=header, tablefmt="grid")
    print(table)


print_results_table(results, optimal_k)

+--------------+---------+---------+---------+---------+---------+---------+---------+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-------------+
| Algorithm    | k = 2   | k = 3   | k = 4   | k = 5   | k = 6   | k = 7   | k = 8   | k = 9   | k = 10   | k = 11   | k = 12   | k = 13   | k = 14   | k = 15   | k = 16   | k = 17   | k = 18   | k = 19   | k = 20   | k = 21   | k = 22   | k = 23   | k = 24   | k = 25   | k = 26   | k = 27   | k = 28   | k = 29   | k = 30   | k = 31   | k = 32   | k = 33   | k = 34   | k = 35   | k = 36   | k = 37   | k = 38   | k = 39   | k = 40   | 

#### How did you approach finding the optimal k?

    For a range of k values (in this case, from 2 to 50), we performed clustering using K-means and Hierarchical clustering algorithms.

    For each k value, we computed three evaluation metrics: Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index. These metrics help us understand the quality of the clustering results.

        Silhouette score: A higher Silhouette score indicates that the clusters are well-separated and the points within a cluster are close to each other.

        Calinski-Harabasz index: A higher Calinski-Harabasz index suggests that the clusters are dense and well-separated.

        Davies-Bouldin index: A lower Davies-Bouldin index implies that the clusters are well-separated and compact.

    We then looked for the k value that yielded the best results in terms of the evaluation metrics mentioned above. In our approach, we considered the k value that maximized the Silhouette score and Calinski-Harabasz index and minimized the Davies-Bouldin index as the optimal k.

#### What algorithm do you believe is the best? Why?

it seems that both K-means and Hierarchical clustering methods have the same optimal k value of 2. To determine which model is best, we can look at the silhouette scores for k=2 for both algorithms.

K-means Silhouette score for k=2: 0.0277
Hierarchical Silhouette score for k=2: 0.0210

Since the silhouette score ranges from -1 to 1, with higher values indicating better cluster separation and cohesion, the K-means clustering algorithm performs slightly better than the Hierarchical clustering algorithm with a silhouette score of 0.0277 compared to 0.0210. So, in this case, the K-means clustering model is preferable.

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [6]:
# First, use the optimal_k values to refit the K-means and Hierarchical models
kmeans_optimal = KMeans(n_clusters=optimal_k["K-means"]["k"], random_state=42)
kmeans_optimal_labels = kmeans_optimal.fit_predict(X)

# hierarchical_optimal = AgglomerativeClustering(n_clusters=optimal_k["Hierarchical"]["k"])
# hierarchical_optimal_labels = hierarchical_optimal.fit_predict(X_dense)

# Add the cluster labels to the DataFrame
df["K-means Cluster ID"] = kmeans_optimal_labels
# df["Hierarchical Cluster ID"] = hierarchical_optimal_labels


# If you want to display the DataFrame with cluster IDs, uncomment the following line:
print(df)


                      city                                           raw text   
0               VA Norfolk  City of Norfolk, VA\n*\nResponse Proposal to U...  \
1            KY Louisville  IMAGINE LOUISVILLE\n“BEYOND TRAFFIC: THE SMART...   
2   MN Minneapolis St Paul  Submitted by:\nCity of Minneapolis\nMr. Steven...   
3             CA Oceanside   \n  \n U.S. Department of Transportation  \nN...   
4                     DC_0   \n   \n    \n \nSmart  DC \nMaking  the Distr...   
..                     ...                                                ...   
59            OH Cleveland   \n \n \n United States  Department of Transpo...   
60              MI Detroit  U.S. DEPARTMENT OF TRANSPORTATION - BEYOND TR ...   
61           CA San Jose_0  1 \n Smart City Challenge:  San José on the \n...   
62               CA Fresno  U.S. Department of Transportation \nNotice of ...   
63            AK Anchorage    CONTENTS \n1 VISION ...........................   

                           

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [7]:

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Save the model to a file called 'model.pkl'
joblib.dump(kmeans, 'model.pkl')

# Load the K-means model from the 'model.pkl' file
loaded_kmeans = joblib.load('model.pkl')


## Derving Themes and Concepts

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [8]:
# Extract the optimal k value for K-means
optimal_k_value = optimal_k['K-means']['k']

# Fit the LDA model
lda = LatentDirichletAllocation(n_components=optimal_k_value, random_state=0)
lda.fit(X)

# Function to print top words for each topic
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx + 1}: "
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

# Print the top 5 words for each topic
print_top_words(lda, vectorizer.get_feature_names_out(), 5)


Topic #1: pky river mill hill myonkers
Topic #2: data transportation system vehicle transit


### Extract themes
Write a theme for each topic (atleast a sentence each).

would district atlanta resident city ng metro

application communication city would downtown operation platform

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [9]:
# Obtain the topic distribution for each city
topic_distribution = lda.transform(X)

# Find the top two topics for each city
top_two_topics = np.argsort(topic_distribution, axis=1)[:, -2:]

# Combine the top two topics into a tuple
combined_topics = list(zip(top_two_topics[:, 1], top_two_topics[:, 0]))

# Add the combined top two topics to the DataFrame
df["Top Topics"] = combined_topics

# Print the DataFrame with combined top two topics
print(df)

                      city                                           raw text   
0               VA Norfolk  City of Norfolk, VA\n*\nResponse Proposal to U...  \
1            KY Louisville  IMAGINE LOUISVILLE\n“BEYOND TRAFFIC: THE SMART...   
2   MN Minneapolis St Paul  Submitted by:\nCity of Minneapolis\nMr. Steven...   
3             CA Oceanside   \n  \n U.S. Department of Transportation  \nN...   
4                     DC_0   \n   \n    \n \nSmart  DC \nMaking  the Distr...   
..                     ...                                                ...   
59            OH Cleveland   \n \n \n United States  Department of Transpo...   
60              MI Detroit  U.S. DEPARTMENT OF TRANSPORTATION - BEYOND TR ...   
61           CA San Jose_0  1 \n Smart City Challenge:  San José on the \n...   
62               CA Fresno  U.S. Department of Transportation \nNotice of ...   
63            AK Anchorage    CONTENTS \n1 VISION ...........................   

                           

## Gathering Applicant Summaries and Keywords

For each smart city applicant, gather a summary and keywords that are important to that document. You can use gensim to do this. Here are examples of functions that you could use.

```python

from gensim.summarization import summarize

def summary(text, ratio=0.2, word_count=250, split=False):
    return summarize(text, ratio= ratio, word_count=word_count, split=split)
    
from gensim.summarization import keywords

def keys(text, ratio=0.01):
    return keywords(text, ratio=ratio)
```

### Add Summaries and Keywords
Add summary and keywords to output file.

## Write output data

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [10]:
df.to_csv('smartcity_eda.tsv', sep='\t')

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
