# Basic data structures and operations in python

![](python_logo.png)

Like all coding languages, python performs <i>operations</i> on <i>data</i>. Any use of python will require you to be familiar with the containers that store data and the operations native to python. This class will look at the data structures and operations in base python––that is, python when it is not augmented by specialist libraries. We will learn about the data structures an operations of specialist NLP libraries in due course.
* Data types in python can be <b>mutable</b> or <b>immutable</b>. Mutable data types can be changed (i.e added to or subtracted from after creation) and include lists (`list`), dictionaries (`dict`), and sets (`set`). Immutable data types like strings (`string`), tuples (`tuple`), and (`int`, `float`, `complex`) cannot be changed after creation; altering them requires the creation of new object.
* <b>Operations</b> in python are used to manipulate data. Examples include <b>iteration</b> (`for` and `while` loops; `list comprehensions`), <b>conditional statements</b> (`if`, `elif`, `else`), <b>logical operators</b> (`and`, `or`, `not`), and <b>functions</b> (`def`).

## Data types

## Lists

The `list` is one of the most versatile data types in python. It is created using square brackets. Essentially it is an ordered container for almost any type of data. The full functionality of the `list` and all the other data structures below [can be found here](https://docs.python.org/3/tutorial/datastructures.html). some of the more common `list` operations can be found below.

In [None]:
# Create a list
disciplines = ["history", "psychology", "physics", "art"]

# Add an item to the list
disciplines.append("NLP")

# Remove an item from the list
disciplines.remove("physics")

# Print the final list
print("Updated discipline list", disciplines)

# Find a specific item in a list by index
print("The second item in the list is", disciplines[1])

# Find a last item in a list by index
print("The last item in the list is", disciplines[-1])

# Find a range of imems in a list by index
print("The second two items in the list are", disciplines[1:3])

#### Problem: Create a `list` of all London boroughs and sort it alphabetically

## Sets

A `set` is an unordered collection of data that contains no repeats. Sets are especially useful when checking if a given item is present in a large dataset, because there is no duplication. For example, checking if the word "apple" is present in a novel is much quicker when the words in the novel are represented as a set. Sets are created using curly braces.

In [None]:
words_as_list = ["cat", "rat", "bat", "rat", "rat", "rat", "rat", "rat", "bat", "bat", "bat", "bat", "bat", "bat"]
words_as_set = {"cat", "rat", "bat", "rat", "rat", "rat", "rat", "rat", "bat", "bat", "bat", "bat", "bat", "bat"}

print("List of words:", words_as_list)
print("Set of words:", words_as_set)

## Dictionaries

A dictionary or `dict` stores information in key-value pairs so that the value can be accessed using the key. Unlike lists, dictionaries are not ordered. A dictionary is created using curly braces surrounding at least one `key:value` pair.

In [None]:
# Create a dictionary
grades = {"Numail": 85, "Sean": 90, "Alice": 78}

# Update Sean's grade
grades["Sean"] = 95

# Add a new student
grades["Diana"] = 88

# Print the dictionary
print("Student Grades:", grades)

#### Problem: Create a `dict`of all London boroughs and their populations

## Tuples

Tuples are like lists in that they are ordered and can contain multiple data types. However, they are immutable: they cannot be chaged after creation. Typically, tuples are used when there is a need for computational efficiency, as they use less memory than lists. Tuples are created using round brackets.

In [None]:
# Define a tuple
coordinate = (10, 20)

# Access elements
x, y = coordinate
print("X:", x, "Y:", y)

#### Problem: Find the area of the rectangle with the vertices below:

In [None]:
a = (1,3)
b = (-1, 3)
c = (-1, -6)
d = (1, -6)


## Strings

Strings are the form that text data takes, and are therefore the most important data type for NLP. We will cover them at length in the next class. For completeness, here is an example of string manipulation.

In [None]:
# Define a string
sentence = "The sun shone, having no alternative, on the nothing new."

# Convert to uppercase
uppercase_sentence = sentence.upper()

# Replace a word
modified_sentence = sentence.replace("sun", "moon")

print("Uppercase:", uppercase_sentence)
print("Modified:", modified_sentence)

## Operations

### The `for` loop 

The `for` loop is one of the most common iterative operations in python. It typically works by performing an operation on an item in a data structure and storing the result

In [None]:
# Create an empty list
cap_letters = []

sentence = "It was a bright cold day in April, and the clocks were striking thirteen"

# Iterate across the letters of the sentence and put the capitalised version of each on the empty list
for i in sentence:
    cap_letters.append(i.upper())
print("Capital letters:", cap_letters)
    
# Squares of numbers from 1 to 5
result = []
for i in range(1, 6):
    result.append(i ** 2)
print("Squares:", result)


### The `list` comprehension

The `list comprehension` is a more efficient version of the `for` loop. It generally better to use list comprehensions when working with large datasets, though they can make your code harder to follow.

In [None]:
low_letters = [i.lower() for i in sentence]
squares = [i**2 for i in range(1,6)]

print("Lowercase letters:", low_letters)
print("Sum of squares:", squares)

### The `if-else` loop

The `if-else` loop is used to select data from a data structure based on a conditional and perform an operation on it. Logical operators like `and`, `or`, and `not` can be used to make complex selections.

In [None]:
numbers_1 = [1, 2, 3, 4, 5, 6]

even_1 = []
odd_1 = []

for i in numbers_1:
    if i % 2 == 0:
        even_1.append(i)
        print(f"{i} is even")
    else:
        odd_1.append(i)
        print(f"{i} is odd")

print("Even list:", even_1)
print("Odd list:", odd_1)


In [None]:
numbers_2 = [1, 2, 3, 4, 5, 6, 10, 13, 16, 19, 27]

even_2 = []
odd_2 = []

for i in numbers_2:
    if i % 2 == 0 and i > 10 or i == 2: # Impose some extra logical conditions
        even_2.append(i)
        print(f"{i} is even")
    else:
        odd_2.append(i)
        print(f"{i} is odd")

print("New even list:", even_2)
print("New odd list:", odd_2)




#### Problem: Create a list of the first names of everyone in the class. Use an `if-then` loop to create a list that contains the names with an even number of letters and a list for names with an odd number of letters.

## Functions

Functions allow you to 'package' a particular piece of complex code so you can re-use efficiently. The <b>argument</b> of a function is the data that the function operates on. Functions are created using the `def` operation, which is short for <i>define</i>.

In [None]:
# Make some data

words = [
    'mirror', 'penguin', 'flame', 'apple', 'candle', 'quilt', 'rose', 'egg', 'jewel', 'queen',
    'nest', 'kite', 'dragon', 'house', 'owl', 'kite', 'olive', 'glove', 'banana', 'engine',
    'amber', 'rabbit', 'yield', 'jungle', 'rose', 'lamp', 'egg', 'olive', 'mirror', 'union',
    'jewel', 'xylophone', 'dog', 'ant', 'kettle', 'vase', 'quiet', 'candle', 'glove', 'ocean',
    'whale', 'king', 'night', 'yarn', 'stone', 'xylophone', 'cat', 'tree', 'net', 'ocean',
    'whale', 'dragon', 'air', 'yacht', 'train', 'iron', 'monkey', 'quiet', 'lamp', 'paint',
    'violet', 'hill', 'cat', 'candle', 'hat', 'amber', 'umbrella', 'rabbit', 'jacket', 'sun',
    'heart', 'vase', 'crane', 'night', 'grape', 'jungle', 'olive', 'nest', 'net', 'quiet',
    'quilt', 'tree', 'violet', 'heart', 'eagle', 'candle', 'rabbit', 'ant', 'flame', 'peach',
    'ant', 'turtle', 'river', 'flame', 'banana', 'engine', 'fish', 'victory', 'uniform', 'uniform'
]

# Create a function that returns a list of the words that contain the letter 'u'

def find_u(word_list):
    u_words = []
    for i in word_list:
        if 'u' in i:
            u_words.append(i)
    return u_words
    
find_u(words)

In [None]:
find_u(['funny', 'sad'])

# Refresher on `pandas`

![](Pandas.png)

There are many data science libraries in python, but `pandas` has emerged as the default for most purposes. It is an exceptionally versatile library that can be used to manipulate data of many different kinds. It is impossible to cover all the functionality of `pandas`, but the some of the most important operations are covered below. The data we use comes from a Twitter dataset.

In [None]:
# Import pandas and shorten the name for covenience
# Import seaborn, a plotting library
import pandas as pd 
import seaborn as sns
sns.set()

### Viewing data

In [None]:
# Create a dataframe by importing our news data.

df = pd.read_csv("twitter_gender.csv", index_col = 0, encoding = "latin-1")


In [None]:
# View the entire dataframe
df 

In [None]:
# See the column names
df.columns

In [None]:
# View a specific row

df.iloc[50]

In [None]:
# View the top n rows
df.head(n = 10)

In [None]:
# View the last n rows
df.tail(n = 10)

In [None]:
# View a selection of rows
df.iloc[34:50]

In [None]:
# Create a new dataframe with fewer rows
df_1 = df.iloc[34:50]
df_1

In [None]:
# View a column
df['text']

In [None]:
# Count the frequencies of column items
df['gender'].value_counts()

In [None]:
# Create a new dataframe with fewer columns
df_cols = df[['gender', 'text', 'retweet_count']]
df_cols

In [None]:
# Create a dataframe for a specific category in a column––here, all the headlines from the iNews

male = df[df['gender'] == 'male']
male

In [None]:
# Get the mean of a specific variable for each cetegory in a column (also words for median, sum, standard deviation and other descriptive statistics).
df.groupby('gender')['retweet_count'].mean()

In [None]:
# Plot the differences between categories
sns.pointplot(x = 'gender' , y = 'retweet_count', data = df)

In [None]:
# Get the index of the max or min of a given variable

most = df['retweet_count'].idxmax()
least = df['retweet_count'].idxmin()

print("The tweet with the most retweets is:", df['text'][most])
print("The tweet with the leasts retweets is:", df['text'][least])