# Introduction to Python


Python is a high-level, interpreted programming language known for its simplicity and readability. It is widely used in various domains, including web development, data analysis, text analysis, artificial intelligence, scientific computing, and more. Python's versatility and extensive library ecosystem make it a popular choice for both beginners and experienced developers.


## Example: Hello World in Python
```python
print("Hello, World!")
```

## Applications of Python
| **Category** | **Libraries/Tools** | **Example Use Case**|
| -------------|---------------------|---------------------|
| Web Development | Django, Flask | Build a blog, dashboard, or interactive visualization|
| Data Science | pandas, NumPy, Matplotlib | Analyze, clean, and visualize quanitative or qualitative data|
| Automation | os, shutil, sched | Rename a batch of files |
| Text Analyis | spaCy, NLTK | Sentiment analysis on reviews |
| Mapping | GeoPandas, Folium, geopy | Plot locations on an interactive map |
| Data Management | pandas, `csv`, `xlm`, `json` | Clean and standardize metadata for a digital collection |
| Webscraping | BeautifulSoup, Scrapy | Collect daily weather data |
| Machine Learning | scikit-learn, TensorFlow | Classify emails as spam|
| Data Processing/Cleaning | pandas, regex, `csv` | Clean historical census data for analysis |
| Education |  Jupyter | Provide introductory material in an easy format |


## Libraries 

Libraries are collections of code, that you can essentially "check-out" to use. Once you import the library, you call different modules or functions within that library to perform specific tasks. 

<img src="img/python-library.jpg" alt="Python Library Example" width="600"/>

In [None]:
## Here we're going to import a couple of libraries that we will use in our codes. 
## Notice that we are using the "as" keyword to give aliases to the libraries - this is a common practice to make the code cleaner and easier to read.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Types

Data types are the kinds of values you can use in Python. Every value in Python has a type and unlike other many other languages you do not have to explicitly set these types. However because of that you have to be diligent to keep track of the types as you go. 

Each type will have it's own set of functions and expressions that can performed. 

### Basic Data Types
| Type | Example | Description |
|------|---------|-------------|
| `int`| 42, -2, 3244235 | Whole numbers (positive or negative) |
| `float` | 3.193727, 0.55, -2423.34243 | Decimal numbers |
| `str` | "hello", "hello, world", "h" | Text (string of characters) | 
| `bool` | True, False | Boolean values (logic, conditionals) | 

In [None]:
## Now let's go through some examples of these data types 

## number of pages - ints


In [None]:
## number of pages - floats 


We can also do other types of mathematics: 

| Operation | Symbol | Example | Output |
|------|---------|-------------|--------|
| Addition | `+` | `1 + 1`| `2`|
| Subtraction |  `-` | `2 - 1` | `1`|
| Multiplication | `*` | `2 * 2` | `4`|
| Division | `/` | ` 3 / 1 ` | `3`|
| Division (floor) | `//` | `10 // 3`| `3`|
| Remainder | `%` | `10 % 3` | `1`|
| Exponent | `**` | ` 3 ** 2` | `9`| 


**Try it Yourself**: Let's combine what we know about variable names, expressions, and int/floats to create a more complex string of commands to figure out the cost of book. Let's say you're buying a book from a independent bookstore and there's both a sales tax of 5% and a shipping fee of 12%. Let's first figure out the sales tax and then the shipping fee if the item is $27 before calculating the total cost. 

In [None]:
item_cost = ... 
sales_tax = ...
shipping_fee = ...
total_cost = ...
print(total_cost, sales_tax, shipping_fee)

In [None]:
## We can also convert numbers to strings
num_pages_str = ...
print(num_pages_str, type(num_pages_str))
print(num_pages, type(num_pages))

In [None]:
## Now let's look more at strings 



In [None]:
## we can do a couple different things with strings, like finding their length
title_length = ...

## or we can add strings together
author_name = "Douglas Adams" 
full_title = ...
print(full_title)

## we can also strip characters from strings, select specific words or characters, and convert them to uppercase or lowercase
stripped_title = ...
print(stripped_title)

last_word = ...
print(last_word)

uppercase_title = ...
print(uppercase_title)

lowercase_title = ...
print(lowercase_title)




### Collection Data Types
| Type | Example | Description |
|------|---------|-------------|
| `list`| `[1, 2, 3]` or `[99, 'hello', False]` | Ordered, changeable sequence |
| `dict` | `{"key":"value"}` or `{"Ada":['Librarian', 42, "Carpenter"]}` | Key-value pairs |
| `tuple` | `(1, 2, 3, 3)` | Ordered, unchangeable sequence | 
| `set` | `set(1, 2, 3)` | Unordered collection of unique values | 

In [None]:
## Now let's look at lists, which are a way to store multiple items in a single variable
book_titles = ["The Hitchhiker's Guide to the Galaxy", 
               "1984", 
               "To Kill a Mockingbird", 
               "Pride and Prejudice"]
print(book_titles)
print(type(book_titles))


In [None]:
## we can select specific items from a list using their index (Python always starts counting at 0)
first_book = ...
print(first_book)

## an easy way to get the last item is to use a negative index
last_book = ...
print(last_book)

## we can also add items to a list using append()
...
print(book_titles)

## or we can remove items using remove()
...
print(book_titles)

## we can also sort lists
...
print(book_titles)

In [None]:
## now let's look at dictionaries, which are a way to store key-value pairs

book_info = {
    "title": "The Hitchhiker's Guide to the Galaxy",
    "author": "Douglas Adams",
    "pages": 216,
    "published_year": 1979
}
...
...



In [None]:
## we can access specific values in a dictionary using their keys
title = ...
print(title)

## we can also add new key-value pairs to a dictionary
...
print(book_info)

## or we can remove key-value pairs using the del keyword
...
print(book_info)

In [None]:
## we can have dictionaries within dictionaries, which is useful for more complex data structures
book_collection = {
    24234: {
        "title": "The Hitchhiker's Guide to the Galaxy",
        "author": "Douglas Adams",
        "pages": 216
    },
    24235: {
        "title": "1984",
        "author": "George Orwell",
        "pages": 368
    }
}
print(book_collection)


In [None]:
## we can access specific books in the collection using their keys
book_id = 24235
book_details = ...
print(book_details)

## we can also access specific values within the nested dictionary
book_title = ...
print(book_title)

## Loops and Conditionals 

### Conditionals or Comparisons
Sometimes we want to be able to compare different variables or expressions. This could be to find matching amounts/text or perhaps you want to identify when a specifc number rises above a certain threshold. We can do this through using comparison statements. 

Reminder:  Booleans - represent either True or False often in logic or conditional statements

In [None]:
## let's look at some examples of using conditional statements 

...


### Methods of comparison 

| Comparison | Operator | 
|------|---------|
| Less than | `<` |
| Greater than | `>` | 
| Less than or equal to | `<=` | 
| Greater than or equal to | `>=` |
| Equal | `==` |
| Not equal | `!=` |

In [None]:
## We can combine if statements with conditional operators to help move through the code 
...



In [None]:
## we can also combine multiple conditions together using keywords like 'and', 'or', and 'not' 
if book1_pages > 200 and book2_pages > 300:
    print("Both books have a significant number of pages.")
else:
    print("At least one book does not have a significant number of pages.")

### Loops (Automation)

Sometimes we will want to repeat the exact same task again and again. In python, loops allow us to create this type of automations to repeat tasks efficiently. The most common types of loops are using *for* and *while*. A for loop is often used to iterate over a sequence such as a list executing the same code for each element in the list. A while loop continues to run as long as a specified condition remains true. 

In [None]:
## The final piece is to look at loops which allows us to iterate over our collection data types

## Let's start with a simple for loop to iterate over a list of book titles
...

In [None]:
## we can also use the range function to iterate over a list by the index values 
for i in range(len(book_titles)):
    print(f"Book {i + 1}: {book_titles[i]}")


In [None]:
## now let's combine loops and conditional statements
for ... in ...:
    if ...:
        print(f"Book ID {book_id} has more than 300 pages: {details['title']}")

## Tables

When working with data in Python, we often organize it in a table with rows and columns. The most powerful tool for working with tables is a library called `pandas` which makes it easy to load data from csv and Excel files, view and explore datasets, filter and clean data, calculate statistics, and prepare data for visualization. 

Tables are stored in an object called a DataFrame. 

In [None]:
## We're going to start by using Pandas to read in a CSV file that contains metadata about novels 
try:
  df = pd.read_csv('top-500-novels-metadata_2025-01-11.csv')  ## https://www.responsible-datasets-in-context.com/posts/top-500-novels/top-500-novels.html?tab=data-essay#whats-in-the-data 
except:
  !wget https://raw.githubusercontent.com/tri-cods/python/refs/heads/main/top-500-novels-metadata_2025-01-11.csv
  df = pd.read_csv('top-500-novels-metadata_2025-01-11.csv')

df

In [None]:
## You can double click on the table output above to hide or collapse it. 
## Notice how the table hides the middle rows and columns to make it easier to read?
## Let's change that so we can see all the columns in the DataFrame.
with pd.option_context('display.max_columns', None):
    display(df)

In [None]:
## we can do some basic analyis on the dataset 
print(df.shape)  # Get the number of rows and columns in the DataFrame
print(df.info())  # Get information about the DataFrame, including data types and non-null counts


In [None]:
print(df.describe())  # Get summary statistics for numerical columns

In [None]:
df.columns

In [None]:
## We can select specific columns from the DataFrame by using their names
...

## We can also select multiple columns by passing a list of column names
df[['title', 'author', 'pub_year']]



In [None]:
## We can filter the DataFrame based on specific conditions
filtered_df = df[...]  # Select rows where the year is greater than 2000
print(filtered_df)

In [None]:
## We can calculate statistics on specific columns
avg_holdings = df['oclc_holdings'].mean()  # Calculate the average number of OCLC holdings
print(avg_holdings)

**Try It Yourself**: Select a column and try to find out some basic summary statistics such as the minimum, maximum, median, mean, and mode values. 

## Visualization

Now let's think about how we might continue to explore this dataset using visualizations. We're going to be using a combination of pandas' built-in plotting functions and the matplotlib library. 

Matplotlib is one of the most powerful and widely used libraries for creating visualizations in Python. While other libraries such as seaborn and plotly offer stylish and interactive plots with less code, matplotlib gives you the ability to have fine-grained control over every aspect of a figure. 


In [None]:
## Example 1: Histograms - Used for showing the distribution of single numerical variable 

df.hist('gr_avg_rating', bins=20, edgecolor='black')  ## here is where we actually create the histogram
plt.xlabel('Average Rating')  ## here we label the x-axis
plt.ylabel('Frequency')  ## here we label the y-axis
plt.title('Distribution of Average Goodreads Ratings')  ## here we give it a title 

Now let's look at a more complicated example. We have several columns that are categorical. We might want to look at the comparison between these columns such as the example below where we examine the difference in Goodreads averages for books in the public domain versus books not in the public domain. 

In [None]:
notPublic = df[(df['pg_eng_url'] == 'NA_not-pub-domain') | (df['pg_eng_url'] == 'unavailable')]
Public = df[df['pg_eng_url'] != 'NA_not-pub-domain']

notPublic['PublicDomain'] = 'Not Public Domain'
Public['PublicDomain'] = 'Public Domain'
updated_table_domain = pd.concat([notPublic, Public])

plt.hist(notPublic['gr_avg_rating'], bins=20, edgecolor='black', alpha=0.5, label='Not Public Domain', color = 'red')
plt.hist(Public['gr_avg_rating'], bins=20, edgecolor='black', alpha=0.5, label='Public Domain', color = 'blue')
plt.legend()
plt.xlabel('Goodreads Average Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Goodreads Average Ratings for Public Domain vs Not Public Domain Novels')
plt.show()
plt.clf()

In [None]:
## we can also do this using seaborn, which is a library built on top of matplotlib that makes it easier to create complex visualizations
import seaborn as sns
sns.histplot(data=df, x='gr_avg_rating', hue = 'pg_eng_url', bins=20, kde=True, color='blue', edgecolor='black')
plt.xlabel('Average Goodreads Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Average Goodreads Ratings')
plt.show()

In [None]:
sns.histplot(data=updated_table_domain, x='gr_avg_rating', hue = 'PublicDomain', bins=20, kde=True, color='blue', edgecolor='black')
plt.xlabel('Average Goodreads Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Average Goodreads Ratings')
plt.show()

In [None]:
## Example 2: Scatter Plots - Used for showing the relationship between two numerical variables 

plt.scatter(df['pub_year'], df['gr_avg_rating'], alpha=0.5, color='blue')
plt.xlabel('Publication Year')
plt.ylabel('Average Goodreads Rating')
plt.title('Publication Year vs Average Goodreads Rating')
plt.show()  

In [None]:
## Example 3: Line Plots - Used for showing trends over time or ordered categories

plt.plot(df['pub_year'], df['gr_avg_rating'], marker='o', linestyle='-', color='blue', alpha=0.5)
plt.xlabel('Publication Year')  
plt.ylabel('Average Goodreads Rating')
plt.title('Publication Year vs Average Goodreads Rating')
plt.show()

In [None]:
## we can see that this doesn't quite work as expected because the publication years are not sorted and are not unique
## so let's try to fix that
sorted_df = df.sort_values(by='pub_year')  # Sort the DataFrame by publication year
plt.plot(sorted_df['pub_year'], sorted_df['gr_avg_rating'], marker='o', linestyle='-', color='blue', alpha=0.5)
plt.clf()

## still not quite what we want so let's group the data by publication year and get an average
grouped_df = df.groupby('pub_year')['gr_avg_rating'].mean().reset_index()  # Group by publication year and calculate the average rating
plt.plot(grouped_df['pub_year'], grouped_df['gr_avg_rating'], linestyle='-', color='blue', alpha=0.5)
plt.xlabel('Publication Year')
plt.ylabel('Average Goodreads Rating')
plt.title('Average Goodreads Rating by Publication Year')
plt.show()


In [None]:
## Example 4: Bar Plots - Used for comparing categorical data
grouped_languages = df.groupby('orig_lang')['gr_avg_rating'].mean().reset_index()  # Group by original language and calculate the average rating
grouped_languages = grouped_languages.sort_values(by='gr_avg_rating', ascending=False)  # Sort by average rating

plt.figure(figsize=(12, 6))  # Set the figure size
plt.bar(grouped_languages['orig_lang'], grouped_languages['gr_avg_rating'], color='blue', alpha=0.7)
plt.xlabel('Original Language')
plt.ylabel('Average Goodreads Rating')
plt.title('Average Goodreads Rating by Original Language')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent overlap


**Try it Yourself**: Pick a column or a series of columns to explore. Try to visualize your selected data - is there any obstacles you encountered? Is the data in a form that you have to alter or manipulate or filter? 

## Application: Text Analysis

Now let's put this all together. 
1. First scroll through the pubicly available options as listed in the cell below. Pick one to fetch from Project Gutenberg. 
2. Using the library `fetch` we will webscrape the entire text to save as an object. 
3. Clean the text to remove the header and footer. 
4. Extract all the words in lowercase format. 
5. Examine the frequency of words via two different types of visualizations. 
6. 

In [None]:
## Step 1: Pick a title to analyze. 
for title in Public['title']:
    print(title)

In [None]:
## Step 2: Pull the data for the book you picked. 

import requests ## library for making HTTP requests

url = Public[Public['title'] == ...]['pg_eng_url'].values[0]  ## Get the URL for your chosen book 
response = requests.get(url)  ## Make a GET request to the URL
if response.status_code == 200:  ## Check if the request was successful
    print("Successfully retrieved the book page.")
else:
    print(f"Failed to retrieve the book page. Status code: {response.status_code}")


In [None]:
response.text[:1000]  ## Print the first 500 characters of the response text to see the content of the page

In [None]:
## Step 3: Clean the data to remove the header and footer from the response text 

## remove the header and footer from the response text 
start = response.text.find("*** START OF THE PROJECT GUTENBERG EBOOK")  ## Find the start of the book text
end = response.text.find("*** END OF THE PROJECT GUTENBERG EBOOK")  ## Find the end of the book text
cleaned_text = response.text[start:end]  ## Extract the book text between the start and end markers

cleaned_text

In [None]:
## Step 4: Process the cleaned text to extract words 

## now let's make everything lowercase and remove any punctuation 
import re ## library for regular expressions
words = re.findall(r'\b[a-z]+\b', cleaned_text.lower())

print(words[:100])  ## Print the first 100 words to see the cleaned text
print(len(words))

In [None]:
## Step 5: Analyze the frequency of words in the text 

## let's take a look at the most common words in the text 

from collections import Counter ## library for counting hashable objects 

word_counts = Counter(words)  ## count the number of times each word appears in the text
word_counts.most_common(10) 

This isn't intersting though - those are all filler words!  Is there a way to easily remove these words from our analysis so that we can what words are really representative of this text? There are several libraries that you can use to do so such nltk and spaCy - here we're going to be using nltk which is a natural language processing library. 

In [None]:

import nltk  ## library for natural language processing
from nltk.corpus import stopwords  ## library for stop words

nltk.download('stopwords')  ## Download the stop words list if you haven't already
stop_words = set(stopwords.words('english'))  ## Get the set of English stop words

for word in list(word_counts.keys()):
    if word in stop_words:  ## Check if the word is a stop word
        del word_counts[word]  ## Remove the stop word from the counts

word_counts.most_common(10)  ## Print the most common words after removing stop words

In [None]:
most_common_words = word_counts.most_common(10)  ## Get the most common words
words, counts = zip(*most_common_words)  ## Unzip the words and counts into separate lists

plt.bar(words, counts, color='blue')
plt.xlabel('Words')
plt.ylabel('Count')
plt.title('Most Common Words')
plt.xticks(rotation=45)  ## Rotate x-axis labels for better readability
plt.tight_layout()  ## Adjust layout to prevent overlap

In [None]:
try:
    from wordcloud import WordCloud  ## library for generating word clouds
except:
    %pip install wordcloud 
    from wordcloud import WordCloud


In [None]:
## Visualisation Number 2: Word Cloud 

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts)  ## Generate a word cloud from the word counts
plt.figure(figsize=(10, 5))  ## Set the figure size
plt.imshow(wordcloud, interpolation='bilinear')  ## Display the word cloud
plt.axis('off')  ## Turn off the axis

## Optional Save the figure as an image file 
plt.savefig('name.png', bbox_inches='tight', dpi=300)  ## Save the figure as a PNG file
plt.show()  ## Show the word cloud

In [None]:
## Step 6: Perform sentiment analysis on the text 

nltk.download('vader_lexicon')  ## Download the VADER lexicon for sentiment analysis

from nltk.sentiment import SentimentIntensityAnalyzer  ## library for sentiment analysis

sia = SentimentIntensityAnalyzer()  ## Create a SentimentIntensityAnalyzer object


In [None]:

## example Sentence 
sentence = "The Wizard of Oz is a wonderful story."
sentiment_scores = sia.polarity_scores(sentence)  ## Get the sentiment scores for the sentence
print(sentiment_scores)  ## Print the sentiment scores for the sentence

In [None]:
## Let's extract the sentences for your book 

sentences = nltk.sent_tokenize(cleaned_text)  ## split the cleaned text into sentences using NLTK's sentence tokenizer
print(sentences[:5])  ## Print the first 5 sentences to see the tokenized sentences


In [None]:
def clean_sentence(sentence):
    """Clean a sentence by removing punctuation and converting to lowercase."""
    # Remove asterisks, underscores, and other decorative symbols
    cleaned_sentence = re.sub(r'[_*#<>+=\\/\[\]{}|]', '', sentence)


    # Replace multiple newlines or spaces with single space
    cleaned_sentence = re.sub(r'\s+', ' ', cleaned_sentence)

    # Remove weird characters (non-ASCII)
    cleaned_sentence = cleaned_sentence.encode('ascii', errors='ignore').decode()

    # Strip leading/trailing spaces
    cleaned_sentence = cleaned_sentence.strip()

    return cleaned_sentence

In [None]:
cleaned_sentences = []

for sentence in sentences:
    cleaned_sentences.append(clean_sentence(sentence))

In [None]:
cleaned_sentences[:100]

In [None]:
results = []

for sentence in cleaned_sentences:
    sentiment_scores = sia.polarity_scores(sentence)  ## Get the sentiment scores for the sentence
    results.append((sentence, sentiment_scores['compound']))  ## Append the sentiment scores to the results list

In [None]:
## visualize the sentiment scores 

df = pd.DataFrame(results, columns=['Sentence', 'Score'])

plt.plot(df['Score'])

In [None]:
## This makes it hard to see any overall trends so let's calculate the average sentiment score fo every 50 sentences. 

## create a dictionary to store the averages 
avgScores = {}

## create counter and avgValue variables to keep track 
counter = 0 
avgValue = 0

## loop through the results 
for i in range(len(results)):
    avgValue += results[i][1]  ## add the sentiment score to the avgValue
    counter += 1  ## increment the counter

    ## when we're ready to go to the next bin, we first calculate the average for the current bin, and reset the counter and avgValue 
    if counter == 50:
        avgScores[i // 50] = avgValue / 50
        counter = 0
        avgValue = 0

plt.plot(list(avgScores.keys()), list(avgScores.values()), marker='o', linestyle='-', color='blue', alpha=0.5)
plt.xlabel('Sentence Bins (Every 50 Sentences)')
plt.ylabel('Average Sentiment Score')
plt.title('Average Sentiment Score of "The Wizard of Oz" by Sentence Bins')


### Congratulations you've finished up the introduction to python tutorial!  

If you're interested in working through some other text analysis examples or challenges: 

1. Create a heatmap of word frequencies by chapter. 
2. Test whether there's a difference in sentiment for shorter sentences versus longer sentences. What built-in function could you use to calculate the length of the sentence?  How might you modify your existing code to add sentence length as a variable?
3. Try topic modeling to identify what words cluster together using scikit-learn. 

You can also check out the numerous Constellate tutorials available (for a limited time!). Including but not limited to:
- Significant Terms Analysis
- Data Visualization 
- Building a language model
- Sentiment Analysis 