In [2]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time


from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
df1 = pd.read_csv('/content/drive/My Drive/Data for Pandas/elections.csv')
df2 = pd.read_csv('/content/drive/My Drive/Data for Pandas/duplicate_columns.csv')
df3 = pd.read_csv('/content/drive/My Drive/Data for Pandas/mottos.csv')


Mounted at /content/drive


## Knowledge Streams 2024

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files
We'll be using **read_csv** today. Note that this file reading function does all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [None]:
#Load csv file and print shape
# Code here
shape = df1.shape
print(shape)
# how many observation and features are given
observations = df1.shape[0]
featurea = df1.shape[1]
#Code here

In [None]:
# We can use the **head command** to show only a few rows of a dataframe from start.
# Code here
head = df1.head(10)
print(head)
#Use **tail command** to show last few observation.
tail = df1.tail(10)
print(tail)
# code here

In [None]:
#The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.
#Code here
new_index = pd.read_csv('/content/drive/My Drive/Data for Pandas/elections.csv', index_col="Year")
print(new_index)

In [None]:
#Alternately, we could have used the **set_index** commmand on the dataframe to set a particular column as index.
# code here
df1.set_index("Year")


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [None]:
df2 = pd.read_csv('/content/drive/My Drive/Data for Pandas/duplicate_columns.csv')

## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [None]:
# Display and Retrieve multiple columns from the election data frame, the resultant would be the list for every column.
#Code here

# Selecting multiple columns
columns = ['Candidate', 'Party']

# Retrieving columns as lists
result = {col: df1[col].tolist() for col in columns}

# Displaying the result
for col, values in result.items():
    print(f"{col}: {values}")

The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.

In [None]:
#The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.
# code here
df1[['Candidate', 'Party']]

# Display the resultant DataFrame

A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [None]:
# Answer Here

In [3]:
# Answer Here# Extract the 'Candidate' column as a Series
candidate_series = df1['Candidate']

# Convert the Series into a DataFrame
candidate_series.to_frame()

Unnamed: 0,Candidate
0,Andrew Jackson
1,John Quincy Adams
2,Andrew Jackson
3,John Quincy Adams
4,Andrew Jackson
...,...
177,Jill Stein
178,Joseph Biden
179,Donald Trump
180,Jo Jorgensen


### Row Indexing

The `[]` operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Extract few rows from DataFrame

In [5]:
subset1 = df3[1:3]

print(subset1)

     State                Motto   Translation Language Date Adopted
1   Alaska  North to the future             —  English         1967
2  Arizona           Ditat Deus  God enriches    Latin         1863


If you provide a single argument to the `[]` operator, it tries to use it as a name. This is true even if the argument passed to **[ ]** is an integer.

In [None]:
#elections[0] #this does not work, try uncommenting this to see it fail in action, woo

The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name.

In [None]:
pd.DataFrame([[1, "one"], [2, "two"]],
columns = ["Number", "Description"])

Creating DataFrames using **Dictionary**.

In [None]:
pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49},
{"Fruit":"Orange", "Price":3.99}])

## Filtering via Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a **filtered version of the data frame**, where **only rows corresponding to True appear**.

In [40]:
elections=[[False, False, False, False, False,
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, True, False]]

One very common task in Data Science is **filtering**. Boolean Array Selection is one way to achieve this in Pandas. We start by observing that **logical operators** like the equality operator can be applied to **Pandas Series data** to generate a **Boolean Array**.

Compare the 'Result' column to the String 'win' and Show results

In [3]:
#Answer Here

bool_filter = df1['Result'] == 'win'

winning_candidates = df1[bool_filter]

print(winning_candidates)

     Year               Candidate                  Party  Popular vote Result  \
1    1824       John Quincy Adams  Democratic-Republican        113142    win   
2    1828          Andrew Jackson             Democratic        642806    win   
4    1832          Andrew Jackson             Democratic        702735    win   
8    1836        Martin Van Buren             Democratic        763291    win   
11   1840  William Henry Harrison                   Whig       1275583    win   
13   1844              James Polk             Democratic       1339570    win   
16   1848          Zachary Taylor                   Whig       1360235    win   
17   1852         Franklin Pierce             Democratic       1605943    win   
20   1856          James Buchanan             Democratic       1835140    win   
23   1860         Abraham Lincoln             Republican       1855993    win   
27   1864         Abraham Lincoln         National Union       2211317    win   
30   1868           Ulysses 

Compare the 'Party' column to the String 'Democratic' and Show results

In [4]:
#Answer Here

bool_filter = df1['Party'] == 'Democratic'

democratic_candidates = df1[bool_filter]

print(democratic_candidates)

     Year               Candidate       Party  Popular vote Result          %
2    1828          Andrew Jackson  Democratic        642806    win  56.203927
4    1832          Andrew Jackson  Democratic        702735    win  54.574789
8    1836        Martin Van Buren  Democratic        763291    win  52.272472
10   1840        Martin Van Buren  Democratic       1128854   loss  46.948787
13   1844              James Polk  Democratic       1339570    win  50.749477
14   1848              Lewis Cass  Democratic       1223460   loss  42.552229
17   1852         Franklin Pierce  Democratic       1605943    win  51.013168
20   1856          James Buchanan  Democratic       1835140    win  45.306080
28   1864     George B. McClellan  Democratic       1812807   loss  45.048488
29   1868         Horatio Seymour  Democratic       2708744   loss  47.334695
34   1876        Samuel J. Tilden  Democratic       4288546   loss  51.528376
37   1880  Winfield Scott Hancock  Democratic       4444976   lo

The output of the logical operator applied to the Series is **another Series with the same name and index, but of datatype boolean**.

These boolean Series can be used as an argument to the `[]` operator.

Creates  DataFrame of all election winners since 1980.

In [7]:
#Answer Here
bool_filter = (df1['Year'] >= 1980) & (df1['Result'] == 'win')

winners_since_1980 = df1[bool_filter]

print(winners_since_1980)

     Year          Candidate       Party  Popular vote Result          %
131  1980      Ronald Reagan  Republican      43903230    win  50.897944
133  1984      Ronald Reagan  Republican      54455472    win  59.023326
135  1988  George H. W. Bush  Republican      48886597    win  53.518845
140  1992       Bill Clinton  Democratic      44909806    win  43.118485
144  1996       Bill Clinton  Democratic      47400125    win  49.296938
152  2000     George W. Bush  Republican      50456002    win  47.974666
157  2004     George W. Bush  Republican      62040610    win  50.771824
162  2008       Barack Obama  Democratic      69498516    win  53.023510
168  2012       Barack Obama  Democratic      65915795    win  51.258484
173  2016       Donald Trump  Republican      62984828    win  46.407862
178  2020       Joseph Biden  Democratic      81268924    win  51.311515


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

Show all 'win' results between 1980 to 2000

In [11]:
#Answer Here
winners_1980_2000 = df1[(df1['Year'] >= 1980) & (df1['Year'] <= 2000) & (df1['Result'] == 'win')]

print(winners_1980_2000)

     Year          Candidate       Party  Popular vote Result          %
131  1980      Ronald Reagan  Republican      43903230    win  50.897944
133  1984      Ronald Reagan  Republican      54455472    win  59.023326
135  1988  George H. W. Bush  Republican      48886597    win  53.518845
140  1992       Bill Clinton  Democratic      44909806    win  43.118485
144  1996       Bill Clinton  Democratic      47400125    win  49.296938
152  2000     George W. Bush  Republican      50456002    win  47.974666


Show all 'Loss' results of Independent party

In [10]:
# Answer Here
loss_independent = df1[(df1['Result'] == 'loss') & (df1['Party'] == 'Independent')]

print(loss_independent)

     Year         Candidate        Party  Popular vote Result          %
121  1976   Eugene McCarthy  Independent        740460   loss   0.911649
130  1980  John B. Anderson  Independent       5719850   loss   6.631143
143  1992        Ross Perot  Independent      19743821   loss  18.956298
161  2004       Ralph Nader  Independent        465151   loss   0.380663
167  2008       Ralph Nader  Independent        739034   loss   0.563842
174  2016     Evan McMullin  Independent        732273   loss   0.539546


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

Show results of win with percentage less than 50%

In [13]:
# Answer Here
low_percentage = df1[(df1['Result'] == 'win') & (df1['%'] < 50)]

print(low_percentage)

     Year          Candidate                  Party  Popular vote Result  \
1    1824  John Quincy Adams  Democratic-Republican        113142    win   
16   1848     Zachary Taylor                   Whig       1360235    win   
20   1856     James Buchanan             Democratic       1835140    win   
23   1860    Abraham Lincoln             Republican       1855993    win   
33   1876   Rutherford Hayes             Republican       4034142    win   
36   1880     James Garfield             Republican       4453337    win   
39   1884   Grover Cleveland             Democratic       4914482    win   
43   1888  Benjamin Harrison             Republican       5443633    win   
47   1892   Grover Cleveland             Democratic       5553898    win   
70   1912     Woodrow Wilson             Democratic       6296284    win   
74   1916     Woodrow Wilson             Democratic       9126868    win   
100  1948       Harry Truman             Democratic      24179347    win   
117  1968   

Show all 'win' results between 1980 to 2000

In [14]:
# Answer Here
winners_1980_2000 = df1[(df1['Year'] >= 1980) & (df1['Year'] <= 2000) & (df1['Result'] == 'win')]

print(winners_1980_2000)

     Year          Candidate       Party  Popular vote Result          %
131  1980      Ronald Reagan  Republican      43903230    win  50.897944
133  1984      Ronald Reagan  Republican      54455472    win  59.023326
135  1988  George H. W. Bush  Republican      48886597    win  53.518845
140  1992       Bill Clinton  Democratic      44909806    win  43.118485
144  1996       Bill Clinton  Democratic      47400125    win  49.296938
152  2000     George W. Bush  Republican      50456002    win  47.974666


## Loc and iLoc

Show 5 enteries from start

In [15]:
print(df1.head(5))

   Year          Candidate                  Party  Popular vote Result  \
0  1824     Andrew Jackson  Democratic-Republican        151271   loss   
1  1824  John Quincy Adams  Democratic-Republican        113142    win   
2  1828     Andrew Jackson             Democratic        642806    win   
3  1828  John Quincy Adams    National Republican        500897   loss   
4  1832     Andrew Jackson             Democratic        702735    win   

           %  
0  57.210122  
1  42.789878  
2  56.203927  
3  43.796073  
4  54.574789  


You can provide `.loc` a list of row labels [0-5] and column labels ['Candidate','Party', 'Year'] as input to return a dataframe

In [16]:

df1_subset = df1.loc[0:5, ['Candidate', 'Party', 'Year']]

print(df1_subset)


           Candidate                  Party  Year
0     Andrew Jackson  Democratic-Republican  1824
1  John Quincy Adams  Democratic-Republican  1824
2     Andrew Jackson             Democratic  1828
3  John Quincy Adams    National Republican  1828
4     Andrew Jackson             Democratic  1832
5         Henry Clay    National Republican  1832


Loc also supports **slicing** (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

Use Slicing on Rows and Columns

In [None]:

df1_subset = df1.loc[0:5, 'Candidate':'Party']

print(df1_subset)


If we provide only a **single label** for the column argument, we get back a **Series**.

In [None]:

candidates_series = df1.loc[0:5, 'Candidate']

print(candidates_series)


If we want a data frame instead and don't want to use to_frame, we can provide a **list** containing the column name.

In [None]:
candidates_df = df1.loc[0:5, ['Candidate']]

print(candidates_df)

If we give only one row but many column labels, we'll get back a **Series** corresponding to a row of the table. This new Series has a neat index, where **each entry is the name of the column** that the data came from.

In [None]:
# Answer Here

In [None]:

row_series = df1.loc[0, ['Candidate', 'Party', 'Year']]

print(row_series)


If we omit the column argument altogether, the **default behavior is to retrieve all columns**.

In [None]:
row_all_columns = df1.loc[0]

print(row_all_columns)

Specify Rows and Columns as List to retrive specific enteries

In [None]:
selected_entries = df1.loc[[0, 1, 2], ['Candidate', 'Party', 'Year']]

print(selected_entries)

Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [None]:
is_democratic = df1['Party'] == 'Democratic'

democratic_candidates = df1.loc[is_democratic]
print(democratic_candidates)

## String-labeled Rows

Let's do a quick example using data with string-labeled rows instead of integer labeled rows, just to make sure we're really understanding loc.

Use mottos.csv file

In [38]:

df3 = pd.read_csv('/content/drive/My Drive/Data for Pandas/mottos.csv')

print(df3)
df3.loc['USA']


             State                                              Motto  \
0          Alabama                      Audemus jura nostra defendere   
1           Alaska                                North to the future   
2          Arizona                                         Ditat Deus   
3         Arkansas                                     Regnat populus   
4       California                                    Eureka (Εὕρηκα)   
5         Colorado                                    Nil sine numine   
6      Connecticut                            Qui transtulit sustinet   
7         Delaware                           Liberty and Independence   
8          Florida                                    In God We Trust   
9          Georgia                        Wisdom, Justice, Moderation   
10          Hawaii                  Ua mau ke ea o ka ʻāina i ka pono   
11           Idaho                                      Esto perpetua   
12        Illinois                  State sovereign

KeyError: 'USA'

Extract slice, can be specified using slice notation, even if the rows have string labels instead of integer labels.

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. 'iloc' slicing is **exclusive**, just like standard Python slicing of numerical values.

Use iloc to extract first 3 rows and columns from elections DataFrame

In [None]:
first_three_rows_columns = df1.iloc[0:3, 0:3]
print(first_three_rows_columns)


We will use both `loc` and `iloc` in the course. `loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g., what column #17 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Handy Properties and Utility Functions for Series and DataFrames

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [None]:
mottos = pd.read_csv("/content/drive/MyDrive/mottos.csv")

In [None]:
# Answer Here



Size of DataFrame

In [18]:
# Answer Here
df3.size

250

The fact that the size is 250 means our data file is relatively small, with only 250 total entries.

Shape of DataFrame

In [19]:
# Answer Here
df3.shape


(50, 5)

Use describe function and extract the meaningful information from DataFrame

In [20]:
# Answer Here
df3.describe

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [21]:
# Answer Here
df3.index

RangeIndex(start=0, stop=50, step=1)

In [None]:
mottos.head(2)

It turns out the columns also have an Index. We can access this index by using `.columns`.

In [22]:
# Answer Here
df3.columns

Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

## Sorting and Value Counts

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [24]:
# Answer Here
df3.sort_values(by='Date Adopted')



Unnamed: 0,State,Motto,Translation,Language,Date Adopted
20,Massachusetts,Ense petit placidam sub libertate quietem,"By the sword we seek peace, but peace only und...",Latin,1775
45,Virginia,Sic semper tyrannis,Thus always to tyrants,Latin,1776
31,New York,Excelsior,Ever upward,Latin,1778
9,Georgia,"Wisdom, Justice, Moderation",—,English,1798
12,Illinois,"State sovereignty, national union",—,English,1819
18,Maine,Dirigo,I lead,Latin,1820
14,Iowa,Our liberties we prize and our rights we will ...,—,English,1847
7,Delaware,Liberty and Independence,—,English,1847
4,California,Eureka (Εὕρηκα),I have found it,Greek,1849
48,Wisconsin,Forward,—,English,1851


As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

If we want to sort in reverse order, we can set `ascending=False`.

In [27]:
#elections.sort_values('%', ascending=False)
sorted_df3_descending = df3.sort_values(by='Date Adopted', ascending=False)

print(sorted_df3_descending)


             State                                              Motto  \
46      Washington                                      Al-ki or Alki   
47   West Virginia                              Montani semper liberi   
6      Connecticut                            Qui transtulit sustinet   
34            Ohio                  With God, all things are possible   
5         Colorado                                    Nil sine numine   
38    Rhode Island                                               Hope   
43            Utah                                           Industry   
41       Tennessee                           Agriculture and Commerce   
39  South Carolina          Dum spiro spero \nAnimis opibusque parati   
29      New Jersey                             Liberty and prosperity   
33    North Dakota                    \nSerit ut alteri saeclo prosit   
35        Oklahoma                                 Labor omnia vincit   
21        Michigan          Si quaeris peninsulam a

We can also use `sort_values` on Series objects.

In [29]:
df3['Language'].sort_values().head(50)

sorted_languages = df3['Language'].sort_values()

print(sorted_languages.head(50))


46    Chinook Jargon
49           English
29           English
28           English
27           English
26           English
48           English
37           English
38           English
40           English
17           English
34           English
42           English
14           English
41           English
12           English
1            English
13           English
8            English
7            English
9            English
43           English
22            French
4              Greek
10          Hawaiian
19           Italian
39             Latin
44             Latin
36             Latin
45             Latin
47             Latin
35             Latin
33             Latin
0              Latin
31             Latin
30             Latin
23             Latin
21             Latin
20             Latin
18             Latin
16             Latin
15             Latin
11             Latin
6              Latin
5              Latin
3              Latin
2              Latin
32           

For Series, the `value_counts` method is often quite handy.

In [30]:
df3['Language'].value_counts()
df3['Language'].value_counts()

Unnamed: 0_level_0,count
Language,Unnamed: 1_level_1
Latin,23
English,21
Greek,1
Hawaiian,1
Italian,1
French,1
Spanish,1
Chinook Jargon,1


Also commonly used is the `unique` method, which returns **all unique values** as a numpy array.

In [35]:
df3['Language'].unique()

unique_languages = df3['Language'].unique()

print(unique_languages)

['Latin' 'English' 'Greek' 'Hawaiian' 'Italian' 'French' 'Spanish'
 'Chinook Jargon']


In [31]:
def fiba(n):
    if n < 2:
        return n
    else:
        return fiba(n-1) + fiba(n-2)



fiba(5)

5

# Thank you!