## Pandas

A library to work with dataframes

In [None]:
import pandas as pd

#### To create a dataframe

In [None]:
data = {
    "name": ["Анна", "Мария", "Софья", "Сидни", "Ксения", "Наталья", "Татьяна", "Александра", "Иван", "Аиша", "Александр", "Анна", "Рахмат", "Алексей", "Анна", "Анна", "Ваньшу"],
    "balloons": [25, 14, 34, 78, 16, 50, 27, 38, 43, 17, 34, 7, 89, 56, 13, 12, 50]
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,balloons
0,Анна,25
1,Мария,14
2,Софья,34
3,Сидни,78
4,Ксения,16
5,Наталья,50
6,Татьяна,27
7,Александра,38
8,Иван,43
9,Аиша,17


In [None]:
data = [
    {"name": "Анна", "balloons": 25},
    {"name": "Сидни", "balloons": 14},
    {"name": "Наталья", "balloons": 34},
    {"name": "Александр", "balloons": 78}
]

df = pd.DataFrame(data)
df

Unnamed: 0,name,balloons
0,Анна,25
1,Сидни,14
2,Наталья,34
3,Александр,78


In [None]:
data = [["Рахмат", 25], ["Ваньшу", 14], ["Ксения", 34], ["Аиша", 78]]

df = pd.DataFrame(data, columns=["name", "balloons"])
df

Unnamed: 0,name,balloons
0,Рахмат,25
1,Ваньшу,14
2,Ксения,34
3,Аиша,78


#### Reading the data

In [None]:
df = pd.read_csv("data/example.csv", sep="\t")  # separator -- a tab
df

In [None]:
df = pd.read_excel("data/example.xlsx", sheet_name="example")
df

In [None]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")[0]
df.head()

Unnamed: 0.1,Unnamed: 0,Country / Dependency,Population,% of world,Date,Source (official or from the United Nations),Unnamed: 6
0,–,World,8065110000,100%,12 Oct 2023,UN projection[3],
1,1,China,1411750000,,31 Dec 2022,Official estimate[4],[b]
2,2,India,1392329000,,1 Mar 2023,Official projection[5],[c]
3,3,United States,335499000,,12 Oct 2023,National population clock[7],[d]
4,4,Indonesia,278696200,,1 Jul 2023,National annual projection[8],


In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/cldf-datasets/wals/master/raw/country.csv")

In [None]:
df.head()

Unnamed: 0,pk,jsondata,id,name,description,markup_description,continent
0,1,,AF,Afghanistan,,,Asia
1,2,,AL,Albania,,,Europe
2,3,,DZ,Algeria,,,Africa
3,4,,AS,American Samoa,,,Australia & Oceania
4,5,,AO,Angola,,,Africa


#### Saving the data

In [None]:
df.to_csv("wals_country.csv", index=None)

#### Manipulating the data

In [None]:
df = df.drop(["jsondata", "description"], axis=1) # to delete columns

# testing
df.head(2)

Unnamed: 0,pk,id,name,markup_description,continent
0,1,AF,Afghanistan,,Asia
1,2,AL,Albania,,Europe


In [None]:
df.dropna() # deleting na

Unnamed: 0,pk,id,name,markup_description,continent


In [None]:
df.shape, df.dropna().shape # the shape of the dataframe: rows, columns

((192, 5), (0, 5))

In [None]:
df.dropna(how="all", axis=1).head() # deleting columns, where all the values are empty

Unnamed: 0,pk,id,name,continent
0,1,AF,Afghanistan,Asia
1,2,AL,Albania,Europe
2,3,DZ,Algeria,Africa
3,4,AS,American Samoa,Australia & Oceania
4,5,AO,Angola,Africa


In [None]:
df = df[["pk", "id", "name", "continent"]] # selecting columns

In [None]:
df.head()

Unnamed: 0,pk,id,name,continent
0,1,AF,Afghanistan,Asia
1,2,AL,Albania,Europe
2,3,DZ,Algeria,Africa
3,4,AS,American Samoa,Australia & Oceania
4,5,AO,Angola,Africa


In [None]:
df[df["pk"] < 4] # selecting values in the pk-column that are less than 4

Unnamed: 0,pk,id,name,continent
0,1,AF,Afghanistan,Asia
1,2,AL,Albania,Europe
2,3,DZ,Algeria,Africa


In [None]:
df["continent"].value_counts() # counting the values

Africa                 54
Asia                   46
Europe                 38
Australia & Oceania    22
North America          17
South America          13
Eurasia                 2
Name: continent, dtype: int64

In [None]:
df[df["continent"] == "North America"]

Unnamed: 0,pk,id,name,continent
14,15,BZ,Belize,North America
26,27,CA,Canada,North America
34,35,CR,Costa Rica,North America
42,43,DM,Dominica,North America
46,47,SV,El Salvador,North America
63,64,GD,Grenada,North America
66,67,GT,Guatemala,North America
70,71,HT,Haiti,North America
71,72,HN,Honduras,North America
81,82,JM,Jamaica,North America


In [None]:
df[df["continent"] == "North America"].sort_values(by="id")

Unnamed: 0,pk,id,name,continent
120,121,AN,Netherlands Antilles,North America
14,15,BZ,Belize,North America
26,27,CA,Canada,North America
34,35,CR,Costa Rica,North America
42,43,DM,Dominica,North America
63,64,GD,Grenada,North America
66,67,GT,Guatemala,North America
71,72,HN,Honduras,North America
70,71,HT,Haiti,North America
81,82,JM,Jamaica,North America


In [None]:
df.groupby("continent").agg({"name": list}) # grouping by continent and making lists of countries

Unnamed: 0_level_0,name
continent,Unnamed: 1_level_1
Africa,"[Algeria, Angola, Benin, Botswana, Burkina Fas..."
Asia,"[Afghanistan, Armenia, Azerbaijan, Bahrain, Ba..."
Australia & Oceania,"[American Samoa, Australia, Fiji, French Polyn..."
Eurasia,"[Russia, Turkey]"
Europe,"[Albania, Austria, Belarus, Belgium, Bosnia-He..."
North America,"[Belize, Canada, Costa Rica, Dominica, El Salv..."
South America,"[Argentina, Bolivia, Brazil, Chile, Colombia, ..."


In [None]:
df.groupby("continent").agg({"pk": "mean"}) # computing the mean

Unnamed: 0_level_0,pk
continent,Unnamed: 1_level_1
Africa,94.703704
Asia,104.195652
Australia & Oceania,113.909091
Eurasia,159.5
Europe,85.921053
North America,88.235294
South America,79.307692


We can apply functions to the values in columns, for example, reduplicate the text in the id-column:

In [None]:
def change_id(text):
    if type(text) == str:
        text = text.lower() * 2
    return text

In [None]:
df["id"].apply(change_id)

0      afaf
1      alal
2      dzdz
3      asas
4      aoao
       ... 
187    wfwf
188    yeye
189    zmzm
190    zwzw
191    ssss
Name: id, Length: 192, dtype: object

Let's work through the following sections of [the pandas tutorial](https://www.w3schools.com/python/pandas/default.asp):

+ Cleaning data
+ Correlations
+ Plotting

#### Practice

A corpus of Disney Princesses created by Lelia Glass

In [None]:
df = pd.read_csv("princess_corpus.csv", sep=",")  # separator -- a comma
df.head(10)

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
0,EARLY,slave in the magic mirror come from the farthe...,NON-P,Snow White,queen,1937,1
1,EARLY,"what wouldst thou know, my queen ?",NON-P,Snow White,mirror,1937,2
2,EARLY,"magic mirror on the wall, who is the fairest o...",NON-P,Snow White,queen,1937,3
3,EARLY,"famed is thy beauty, majesty. but hold, a love...",NON-P,Snow White,mirror,1937,4
4,EARLY,alas for her ! reveal her name.,NON-P,Snow White,queen,1937,5
5,EARLY,lips red as the rose. hair black as ebony. ski...,NON-P,Snow White,mirror,1937,6
6,EARLY,snow white !,NON-P,Snow White,queen,1937,7
7,EARLY,want to know a secret ? promise not to tell ? ...,PRINCESS,Snow White,snow white,1937,8
8,EARLY,today,PRINCE,Snow White,prince,1937,9
9,EARLY,oh !,PRINCESS,Snow White,snow white,1937,10


Show the columns of the dataframe (df.columns):

Choose the columns ['Disney_Period', 'Text', 'Speaker_Status', 'Movie', 'Speaker', 'Year']

Count the number of utterances per period:

Count the number of utterances per movie:

Print out movies grouped by the time period:

Apply a tokenization function to the text of the utterances!

Count the number of tokens per utterance

Compute the mean length of the utterances in tokens and group it by speaker status and Disney period

Plot the results