## Student Name: 
## Student Email:

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | summary | keywords|
| -- | -- | -- | -- | -- | -- | -- |
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| test | test |

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [None]:
import os
from PyPDF2 import PdfReader

pdf_dir = 'smartcity/'
all_text = []

# Loop over all files in the PDF directory
for filename in os.listdir(pdf_dir):
    # Check if the file is a PDF file
    if filename.endswith('.pdf'):
        # Open the PDF file
        with open(os.path.join(pdf_dir, filename), 'rb') as pdf_file:
            # Create a PdfReader object
            pdf_reader = PdfReader(pdf_file)
            # Loop over each page in the PDF file
            for page_num in range(len(pdf_reader.pages)):
                # Extract the text from the page
                page_text = pdf_reader.pages[page_num].extract_text()

                # Add the text to the list of text from all files
                all_text.append(page_text)


# Print the text from all files
print(all_text)


Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [None]:
import os
from PyPDF2 import PdfReader

pdf_dir = 'smartcity/'
data = []

# Loop over all files in the PDF directory
for filename in os.listdir(pdf_dir):
    # Check if the file is a PDF file
    if filename.endswith('.pdf'):
        # Extract the city name from the filename
        city_name = filename.split('.')[0]
        # Open the PDF file
        with open(os.path.join(pdf_dir, filename), 'rb') as pdf_file:
            # Create a PdfReader object
            pdf_reader = PdfReader(pdf_file)
            # Extract the raw text from the PDF file
            raw_text = ''
            for page in pdf_reader.pages:
                raw_text += page.extract_text()
            # Add the data to the list of dictionaries
            data.append({
                'city': city_name,
                'text': raw_text
            })

# Print the data
print(data)


## Cleaning Up PDFs

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [None]:
import os
from PyPDF2 import PdfFileReader
import re
from contractions import CONTRACTION_MAP
from text_normalizer import normalize_corpus

def load_and_clean_pdfs(directory):
    data = []
    cities_to_remove = []
    
    # Iterate through all PDF files in the directory
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory, filename)
            
            # Extract city name from file name
            city = filename[:-4]
            
            # Extract raw text from PDF file
            with open(filepath, "rb") as f:
                pdf_reader = PdfFileReader(f)
                raw_text = ""
                for i in range(pdf_reader.getNumPages()):
                    page = pdf_reader.getPage(i)
                    raw_text += page.extractText()
            
            # Clean raw text
            clean_text = normalize_text(raw_text)
            
            # Check if text was not processed correctly
            if clean_text.strip() == "":
                cities_to_remove.append(city)
                continue
            
            # Remove terms that may affect clustering and topic modeling
            clean_text = remove_terms(clean_text)
            
            # Add city name, raw text, and clean text to data list
            data.append({"city": city, "raw text": raw_text, "clean text": clean_text})
            
            # Check if more than 15 cities have been removed
            if len(cities_to_remove) > 15:
                break
    
    # Remove cities that were not processed correctly
    data = [d for d in data if d["city"] not in cities_to_remove]
    
    return data

def normalize_text(text):
    # Apply text normalization using contractions and text normalizer scripts
    # Code from Text Analytics with Python, Chapter 3: contractions.py (Pages 136-137)
    # and text_normalizer.py (Pages 155-156)
    text = expand_contractions(text, CONTRACTION_MAP)
    text = normalize_corpus(text)
    return text

def expand_contractions(text, contraction_map):
    # Function to expand contractions in the text
    # Code from Text Analytics with Python, Chapter 3: contractions.py (Pages 136-137)
    contraction_pattern = re.compile('({})'.format('|'.join(contraction_map.keys())), 
                                     flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_map.get(match)\
                                   if contraction_map.get(match)\
                                   else contraction_map.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_text = contraction_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_terms(text):
    # Function to remove terms that may affect clustering and topic modeling
    # You can customize this function based on your specific requirements
    # For example, removing cities, states, and common words
    terms_to_remove = ['city', 'state', 'smart', 'page']
    for term in terms_to_remove:
        text = text.replace(term, "")
    return text

# Example usage
data = load_and_clean_pdfs("smartcity")
print(data)


#### Add the cleaned text to the structure you created.


In [None]:
def load_and_clean_pdfs(directory):
    data = []
    cities_to_remove = []
    
    # Iterate through all PDF files in the directory
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory, filename)
            
            # Extract city name from file name
            city = filename[:-4]
            
            # Extract raw text from PDF file
            with open(filepath, "rb") as f:
                pdf_reader = PdfFileReader(f)
                raw_text = ""
                for i in range(pdf_reader.getNumPages()):
                    page = pdf_reader.getPage(i)
                    raw_text += page.extractText()
            
            # Clean raw text
            clean_text = normalize_text(raw_text)
            
            # Check if text was not processed correctly
            if clean_text.strip() == "":
                cities_to_remove.append(city)
                continue
            
            # Remove terms that may affect clustering and topic modeling
            clean_text = remove_terms(clean_text)
            
            # Add city name, raw text, and clean text to data list
            data.append({"city": city, "raw text": raw_text, "clean text": clean_text})
            
            # Check if more than 15 cities have been removed
            if len(cities_to_remove) > 15:
                break
    
    # Remove cities that were not processed correctly
    data = [d for d in data if d["city"] not in cities_to_remove]
    
    return data

# Example usage
data = load_and_clean_pdfs("smartcity")
print(data)


### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

 I removed the Smart City applicants that had empty or blank text after the cleaning process. These documents were identified based on the condition clean_text.strip() == "", where clean_text represents the processed text of each document. The issues with these documents were that they either contained no meaningful content or the cleaning process resulted in the removal of all relevant information, making them irrelevant for further analysis.

#### Explain what additional text processing methods you used and why.

In addition to the basic text cleaning steps such as removing punctuation and converting to lowercase, I used the following additional text processing methods:

Lemmatization: I applied lemmatization to reduce words to their base or dictionary form. This helps in normalizing the text and reducing the dimensionality of the data. It improves the accuracy of topic modeling and keyword extraction by treating different inflected forms of a word as a single entity.

Stop-word Removal: I removed common stop words that do not carry significant meaning, such as "the," "is," "in," etc. This step helps to eliminate noise and reduce the dimensionality of the data, focusing on more meaningful and informative words.

Custom Term Removal: I removed additional terms that were specific to the Smart City domain and may not contribute much to the analysis. These terms, such as "city," "state," "smart," and "page," were identified based on domain knowledge and the specific requirements of the analysis.

These additional text processing methods were used to improve the quality of the text data, remove noise, and focus on meaningful words for topic modeling and keyword extraction.

#### Did you identify any potientally problematic words?

Based on the output, there were no explicitly mentioned potentially problematic words. However, it is important to review the data and output in detail to identify any words or terms that may cause issues or misinterpretations. It is always recommended to have domain knowledge and context understanding to identify potentially problematic words or terms in specific applications or analyses.

## Experimenting with Clustering Models

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|--|
|Hierarchical |--|--|--|--|
|DBSCAN | X | X | X | -- |



In [None]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

def find_optimal_k(data, min_k, max_k):
    optimal_k = None
    max_silhouette_score = -1
    
    for k in range(min_k, max_k+1):
        kmeans = KMeans(n_clusters=k, random_state=0)
        labels = kmeans.fit_predict(data)
        silhouette = silhouette_score(data, labels)
        
        if silhouette > max_silhouette_score:
            max_silhouette_score = silhouette
            optimal_k = k
    
    return optimal_k

def evaluate_clustering(data, k):
    kmeans = KMeans(n_clusters=k, random_state=0)
    hierarchical = AgglomerativeClustering(n_clusters=k)
    dbscan = DBSCAN()
    
    kmeans_labels = kmeans.fit_predict(data)
    hierarchical_labels = hierarchical.fit_predict(data)
    dbscan_labels = dbscan.fit_predict(data)
    
    silhouette_kmeans = silhouette_score(data, kmeans_labels)
    silhouette_hierarchical = silhouette_score(data, hierarchical_labels)
    silhouette_dbscan = silhouette_score(data, dbscan_labels)
    
    calinski_harabasz_kmeans = calinski_harabasz_score(data, kmeans_labels)
    calinski_harabasz_hierarchical = calinski_harabasz_score(data, hierarchical_labels)
    calinski_harabasz_dbscan = calinski_harabasz_score(data, dbscan_labels)
    
    davies_bouldin_kmeans = davies_bouldin_score(data, kmeans_labels)
    davies_bouldin_hierarchical = davies_bouldin_score(data, hierarchical_labels)
    davies_bouldin_dbscan = davies_bouldin_score(data, dbscan_labels)
    
    return [[silhouette_kmeans, calinski_harabasz_kmeans, davies_bouldin_kmeans],
            [silhouette_hierarchical, calinski_harabasz_hierarchical, davies_bouldin_hierarchical],
            [silhouette_dbscan, calinski_harabasz_dbscan, davies_bouldin_dbscan]]

# Perform clustering and evaluation for each k value
k_values = [9, 18, 36]

# Optimal k for K-means
optimal_k_kmeans = find_optimal_k(data, 2, 50)

# Optimal k for Hierarchical
optimal_k_hierarchical = find_optimal_k(data, 2, 50)

# Evaluate clustering algorithms
results = []
for k in k_values:
    result = evaluate_clustering(data, k)
    results.append(result)

# Print the results
print("Algorithm\tk = 9\t\tk = 18\t\tk = 36\t\tOptimal k")
print("K-means\t\t{}\t\t{}\t\t{}\t\t{}".format(results[0][0], results[1][0], results[2][0], "-"))
print("Hierarchical\t{}\t\t{}\t\t{}\t\t{}".format(results[0][1], results[1][1], results[2][1], "-"))
print("DBSCAN\t\t{}\t\t{}\t\t{}\t\t{}".format(results[0][2], results[1][2], results[2][2], "-"))


#### How did you approach finding the optimal k?

When finding the optimal k, I used the K-means algorithm and the Silhouette score as the evaluation metric. The Silhouette score measures how well each data point fits its assigned cluster, with higher values indicating better-defined clusters. I iterated through a range of k values and selected the k that resulted in the highest Silhouette score.

#### What algorithm do you believe is the best? Why?

Regarding the best algorithm, it depends on the specific characteristics of the data and the problem at hand. In this case, since we are dealing with text data from smart city documents, it's difficult to determine the best algorithm without further information. However, K-means and Hierarchical clustering are commonly used for text data clustering. K-means is computationally efficient and suitable for a large number of samples, while Hierarchical clustering can capture hierarchical relationships between clusters. It's recommended to evaluate the performance of both algorithms using appropriate metrics and domain knowledge to make an informed decision.

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [None]:
import os
from PyPDF2 import PdfFileReader
import argparse

# Import contractions and text_normalizer functions from the provided scripts
from contractions import CONTRACTION_MAP
from text_normalizer import normalize_corpus

# Function to load and clean PDFs
def load_and_clean_pdfs(directory):
    data = []
    # Iterate through all PDF files in the directory
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory, filename)
            # Extract city name from file name
            city = filename[:-4]
            # Extract raw text from PDF file
            with open(filepath, "rb") as f:
                pdf_reader = PdfFileReader(f)
                raw_text = ""
                for i in range(pdf_reader.getNumPages()):
                    page = pdf_reader.getPage(i)
                    raw_text += page.extractText()
            # Clean raw text using text_normalizer module
            clean_text = normalize_corpus(raw_text)
            # Add city name, raw text, and clean text to data list
            data.append({"city": city, "raw text": raw_text, "clean text": clean_text})
    return data

# Function to write city data to output file
def write_city_to_output_file(city, raw_text, clean_text, cluster_id, summary, keywords, output_file):
    # Write new city to output file as a tab-separated row with cluster ID
    with open(output_file, "a", encoding="utf-8") as f:
        f.write("{}\t{}\t{}\t{}\t{}\t{}\n".format(city, raw_text, clean_text, cluster_id, summary, keywords))

if __name__ == '__main__':
    # Define argparse to handle command line arguments
    parser = argparse.ArgumentParser(description='Predict the type of cluster a new smart city document belongs to.')
    parser.add_argument('--document', type=str, required=True, help='Name of the new smart city document to be predicted (including path if not in current directory)')
    parser.add_argument('--summarize', action='store_true', help='Include this argument to generate a summary for the new document')
    parser.add_argument('--keywords', action='store_true', help='Include this argument to extract keywords from the new document')
    args = parser.parse_args()

    # Load and clean PDFs from "smartcity" directory
    data = load_and_clean_pdfs("smartcity")

    # Perform clustering and obtain cluster IDs for each smart city
    cluster_ids = [0, 1, 2]  # Replace with the actual cluster IDs

    # Iterate over the data structure and append cluster ID to the output file
    output_file = "smartcity_predict.tsv"  # Replace with the path to your output file
    for i, city_data in enumerate(data):
        city = city_data["city"]
        raw_text = city_data["raw text"]
        clean_text = city_data["clean text"]
        cluster_id = cluster_ids[i]
        summary = city_data["summary"] if args.summarize else ""
        keywords = city_data["keywords"] if args.keywords else ""

        write_city_to_output_file(city, raw_text, clean_text, cluster_id, summary, keywords, output_file)


### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [None]:
# Save the K-means model
import joblib

joblib.dump(kmeans, 'model.pkl')


## Derving Themes and Concepts

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [None]:
from gensim import corpora, models

# Create a list of tokenized documents
documents = [doc['clean text'] for doc in data]

# Create a dictionary of all the unique words in the documents
dictionary = corpora.Dictionary(documents)

# Convert the documents into a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Define the number of topics
num_topics = <number_of_topics>

# Train the LDA model
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Get the top five words for each topic
top_words_per_topic = []
for topic_id in range(num_topics):
    top_words = lda_model.show_topic(topic_id, topn=5)
    top_words_per_topic.append([word for word, _ in top_words])

# Print the top five words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i + 1}: {' '.join(top_words)}")


In [4]:
from gensim import corpora, models

# Create a list of tokenized documents
documents = [doc['clean text'] for doc in data]

# Create a dictionary of all the unique words in the documents
dictionary = corpora.Dictionary(documents)

# Convert the documents into a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Define the number of topics
num_topics = 5

# Train the LDA model
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Get the top five words for each topic
top_words_per_topic = []
for topic_id in range(num_topics):
    top_words = lda_model.show_topic(topic_id, topn=5)
    top_words_per_topic.append([word for word, _ in top_words])

# Print the top five words for each topic
for i, top_words in enumerate(top_words_per_topic):
    print(f"Topic {i + 1}: {' '.join(top_words)}")


KeyError: 'clean text'

### Extract themes
Write a theme for each topic (atleast a sentence each).

[Your Answer]

[Your Answer]

[Your Answer]

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [None]:
from gensim import corpora, models

# Create a list of tokenized documents
documents = [doc['clean text'] for doc in data]

# Create a dictionary of all the unique words in the documents
dictionary = corpora.Dictionary(documents)

# Convert the documents into a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Define the number of topics
num_topics = <number_of_topics>

# Train the LDA model
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Get the top two topics for each smart city
for doc in data:
    # Get the document's bag-of-words representation
    doc_bow = dictionary.doc2bow(doc['clean text'])
    
    # Get the document's topic distribution
    doc_topics = lda_model[doc_bow]
    
    # Sort the topics by their probability in descending order
    doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    
    # Get the top two topics for the document
    top_topics = [topic_id for topic_id, _ in doc_topics[:2]]
    
    # Add the top topics to the data structure
    doc['top_topics'] = top_topics

# Print the updated data structure
for doc in data:
    print(f"City: {doc['city']}, Top Topics: {doc['top_topics']}")


## Gathering Applicant Summaries and Keywords

For each smart city applicant, gather a summary and keywords that are important to that document. You can use gensim to do this. Here are examples of functions that you could use.

```python

from gensim.summarization import summarize

def summary(text, ratio=0.2, word_count=250, split=False):
    return summarize(text, ratio= ratio, word_count=word_count, split=split)
    
from gensim.summarization import keywords

def keys(text, ratio=0.01):
    return keywords(text, ratio=ratio)
```

In [6]:
import os
from PyPDF2 import PdfFileReader
import re
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def load_and_clean_pdfs(directory):
    data = []
    cities_to_remove = []

    # Iterate through all PDF files in the directory
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory, filename)

            # Extract city name from file name
            city = filename[:-4]

            # Extract raw text from PDF file
            with open(filepath, "rb") as f:
                pdf_reader = PdfFileReader(f)
                raw_text = ""
                for i in range(pdf_reader.getNumPages()):
                    page = pdf_reader.getPage(i)
                    raw_text += page.extractText()

            # Clean raw text
            clean_text = normalize_text(raw_text)

            # Check if text was not processed correctly
            if clean_text.strip() == "":
                cities_to_remove.append(city)
                continue

            # Remove terms that may affect clustering and topic modeling
            clean_text = remove_terms(clean_text)

            # Add city name, raw text, and clean text to data list
            data.append({"city": city, "raw text": raw_text, "clean text": clean_text})

            # Check if more than 15 cities have been removed
            if len(cities_to_remove) > 15:
                break

    # Remove cities that were not processed correctly
    data = [d for d in data if d["city"] not in cities_to_remove]

    return data

def normalize_text(text):
    contraction_map = {
        # Define your contraction mapping here
    }

    contraction_pattern = re.compile('({})'.format('|'.join(contraction_map.keys())),
                                     flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_map.get(match) if contraction_map.get(match) else contraction_map.get(
            match.lower())
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction

    expanded_text = contraction_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_terms(text):
    terms_to_remove = ['city', 'state', 'smart', 'page']
    for term in terms_to_remove:
        text = text.replace(term, "")
    return text

def generate_summary_keywords_spacy(document):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(document)
    sentences = [sent.text for sent in doc.sents]
    summary = ". ".join(sentences[:2])
    stop_words = set(stopwords.words("english"))
    keywords = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]
    return summary, keywords

def generate_summary_keywords_nltk(document):
    stop_words = set(stopwords.words("english"))
    sentences = nltk.sent_tokenize(document)
    summary = ". ".join(sentences[:2])
    keywords = [word.lower() for word in word_tokenize(document) if word.lower() not in stop_words and word.isalnum()]
    return summary, keywords

# Specify the directory containing the PDF files
directory = "smartcity"

# Load and clean the PDFs
data = load_and_clean_pdfs(directory)

# Generate summaries and keywords for each document
for doc in data:
    document = doc['clean text']
    summary, keywords = generate_summary_keywords_spacy(document) 


Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting joblib
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting regex>=2021.8.3
  Downloading regex-2023.5.5-cp39-cp39-macosx_11_0_arm64.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.9/288.9 KB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: regex, joblib, nltk
Successfully installed joblib-1.2.0 nltk-3.8.1 regex-2023.5.5
You should consider upgrading via the '/Users/sagarsingh/.pyenv/versions/3.9.13/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


### Add Summaries and Keywords
Add summary and keywords to output file.

In [None]:
# Specify the directory containing the PDF files
directory = "smartcity"

# Load and clean the PDFs
data = load_and_clean_pdfs(directory)

# Generate summaries and keywords for each document
for doc in data:
    document = doc['clean text']
    summary, keywords = generate_summary_keywords_spacy(document)
    
    # Add summary and keywords to the data structure
    doc['summary'] = summary
    doc['keywords'] = keywords
    
    # Append summary and keywords to the output file
    with open("output.txt", "a") as f:
        f.write(f"City: {doc['city']}\n")
        f.write(f"Summary: {summary}\n")
        f.write(f"Keywords: {', '.join(keywords)}\n\n")


## Write output data

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [None]:
import pandas as pd

# Convert the data to a DataFrame
df = pd.DataFrame(data)

# Specify the output file path
output_file = 'smartcity_eda.tsv'

# Write the DataFrame to a TSV file
df.to_csv(output_file, sep='\t', index=False)


# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
