<a href="https://colab.research.google.com/github/salonisngh/cognitive-computing/blob/main/What_is_Faker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#What is Faker?

**Faker** is a Python library that helps you generate **realistic fake data**.

Why use it?
- For **testing code** when you don’t have real data
- To practice **data analysis or machine learning**
- To simulate **real-world scenarios** like customer data, logs, or sensor data

Today, we’ll use it to simulate **urban traffic incident reports**.


### Install and Import Faker

Let’s install the Faker library and import it to our notebook.


In [None]:
# Install the Faker library to generate fake data
!pip install faker

# Import Faker for generating data
from faker import Faker
# Import pandas for handling tabular data
import pandas as pd



# Step 1: Initialize Faker

We now create a Faker generator object to start producing fake data.


In [None]:
# Create a Faker instance
# This object will allow us to generate fake cities, days, and sentences
fake = Faker()

# Step 2: Generate Fake Traffic Data

We'll create:
- A random city or location
- A day of the week
- Vehicle count (random between 400 and 1200)
- A random incident note from a traffic officer

We’ll use this data for our case study.


In [None]:
# Create a dictionary with fake traffic data
data = {
    # Generate 10 fake city names (could represent intersections)
    "location": [fake.city() for _ in range(10)],
    # Generate 10 random days of the week
    "day_of_week": [fake.day_of_week() for _ in range(10)],
    # Generate 10 vehicle counts between 400 and 1200
    "vehicle_count": [fake.random_int(400, 1200) for _ in range(10)],
    # Generate 10 fake incident sentences with about 12 words each
    "incident_note": [fake.sentence(nb_words=12) for _ in range(10)]
}

# Convert this dictionary into a Pandas DataFrame
df = pd.DataFrame(data)

# Show the first few rows of the dataset
df.head()

Unnamed: 0,location,day_of_week,vehicle_count,incident_note
0,South Brandonside,Saturday,605,Window wear service degree town pay treat clai...
1,Gravesborough,Thursday,783,Receive answer down two method person more sim...
2,Matthewchester,Tuesday,815,Boy per money close edge once society.
3,Rachelton,Wednesday,1073,Sure month camera its beat star lead heart res...
4,Stevenburgh,Wednesday,638,Because nothing particularly interview sit new...


# Step 3: Export Incident Reports for NLP

We'll save the `incident_note` column to a text file.

This will let us later:
- Clean the text
- Tokenize it
- Extract important keywords (as we did in Assignment 9)


In [None]:
# Open a new text file to write incident notes
with open("my_incident_notes.txt", "w") as f:
    # Write each note from the DataFrame to a new line in the file
    for note in df["incident_note"]:
        f.write(note + "\n")

# Step 4: Save Dataset for Pandas and Plotting

Now we’ll export the full table to a `.csv` file. We'll use this later to:
- Group by day
- Analyze traffic
- Plot results using Matplotlib


In [None]:
# Export the full dataset to a CSV file for use in other tools or tasks
df.to_csv("my_traffic_data.csv", index=False)

#  Text Cleaning with NLTK

Let’s quickly see how we can process the `incident_notes.txt` file:

- Convert to lowercase
- Remove punctuation
- Remove stopwords
- Show top 5 frequent words

We'll use this idea in our main question task.


In [None]:
# Import the Natural Language Toolkit (NLTK) library
import nltk
# Download necessary language resources from NLTK
nltk.download("punkt")       # For tokenizing words
nltk.download("stopwords")   # For removing common words like "the", "is"
nltk.download('punkt_tab')    # Optional, resolves some rare tokenizer issues

# Import specific modules for tokenization and stopword handling
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from collections import Counter

# Read the incident notes text file
with open("my_incident_notes.txt", "r") as f:
    text = f.read()

# Convert the text to lowercase and tokenize into words
tokens = word_tokenize(text.lower())

# Remove punctuation and stopwords to get meaningful keywords
clean_tokens = [
    t for t in tokens
    if t not in string.punctuation and t not in stopwords.words("english")
]

# Count and display the 5 most common words
Counter(clean_tokens).most_common(5)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


[('financial', 2), ('feel', 2), ('south', 2), ('window', 1), ('wear', 1)]