# I. Basic Data Structures
The basic data structures in Python include <b>tuple, list, string, dictionary</b>. The manipulation of these data structures is summarized in this section. The manipulation technics of the more complex data structure, <b>series, dataframe, and array</b> will be introduced in the subsequent sections!<br/><br/>
The data type of variables can be identified with <font color='red'><b>type</b></font> function, i.e. <font color='red'><b>type(variable)</b></font>. Take note that checking the data type of variables might be very helpful in debugging!

In [26]:
type('This is a string')

str

In [4]:
type([1, 2, 3])

list

## 1.1 Tuple (not frequently used)
1. Tuples are an immutable data structure. The elements in a tuple cannot be altered once it is declared. No element can be added to an existing tuple.<br/>
2. Tuples are declared by including the elements within <font color='red'><b>( )</b></font>.<br/>
3. Different types of data can be put in a tuple.<br/>
4. Element in a tuple can be accessed with index, i.e. <font color='red'><b>tuple(index)</b></font>. Index starts from 0.

E.g.1 Declare a tuple and access its 2nd element.

In [12]:
x = (1, 'a', 2, 'b')
x[1]

'a'

E.g.2 Unpack a tuple/list into different variables. Make sure the number of values to be unpacked matches the number of variables being assigned.

In [60]:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu')
fname, lname, email = x
lname

'Brooks'

## 1.2 List
1. Lists are a mutable data structure. It is similar to the vector in R.
2. Lists are declared by including the elements within <font color='red'><b>[ ]</b></font>.<br/>
3. Different types of data can be put in a list.<br/>
4. Element in a list can be accessed with index, i.e. <font color='red'><b>list[index]</b></font>. Index starts from 0.<br/>
5. New element can be added to list with the <b>append</b> function, i.e. <font color='red'><b>list.append(new_element)</b></font>.<br/>
6. Multiple lists can be concatenated with <b>"+"</b> or <b>"`*`"</b> operator, i.e. <font color='red'><b>list1 + list2, list `*` n</b></font>.<br/>
7. The list elements can be viewed by <font color='red'><b>print(list)</b></font>.<br/>
8. The number of elements in a list can be obtained by <font color='red'><b>len(list)</b></font> function.<br/>
9. Existence of element in a list can be checked with <b>in</b> operator, i.e. <font color='red'><b>element in list</b></font>.

E.g.1 Declare a list, append element to list, and print the elements in list.

In [13]:
x = [1, 'a', 2, 'b']
y = [] # Declare an empty list
x.append(3.3)
print(x)

[1, 'a', 2, 'b', 3.3]


E.g.2 Loop through and print each item in the list.

In [16]:
for item in x:
    print(item)

1
a
2
b
3.3


E.g.3 Loop through each item using the indexing operator.

In [18]:
i = 0
while( i != len(x) ):
    print(x[i])
    i = i + 1

1
a
2
b
3.3


E.g.4 Concatenate lists with "+" operator.

In [20]:
[1, 2] + [3, 4]

[1, 2, 3, 4]

E.g.5 Repeat lists "`*`" operator.

In [22]:
[1] * 3

[1, 1, 1]

E.g.6 Check existence of element in a list. A boolean result is returned.

In [24]:
1 in [1, 2, 3]

True

## 1.3 String (list with character elements)
String is considered as a special type of list with character elements. <b>All the list manipulation in Section 1.2 can be applied to strings. The following string slicing manipulations are also applicable to the normal lists.</b> Some string-specific technics include<br/>
1. Split a string with <b>split</b> function on specifc character into a list of substrings, i.e. <font color="red"><b>string.split("character")</b></font><br/>
2. Convert other data type to string by <font color="red"><b>str(variable)</b></font>.

E.g.1 Slice string with indexing operator, i.e. <font color="red"><b>string[starting index:stopping index]<b/><font/>

In [30]:
x = 'This is a string'
print(x[0]) # first character
print(x[0:1]) # first character, but we have explicitly set the end character
print(x[0:2]) # first two characters
print(x[-1]) # last character of the string
print(x[-4:-2]) # slice starting from the 4th element from the end and stopping before the 2nd element from the end
print(x[:3]) # slice from the beginning of the string and stopping before the 3rd element
print(x[3:]) # slice starting from the 3rd element of the string and going all the way to the end

T
T
Th
g
ri
Thi
s is a string


E.g.2 Concatenate strings with "+" or "`*`" operator, check existence of substring with the "in" operator.

In [35]:
firstname = 'Christopher'
lastname = 'Brooks'

print(firstname + ' ' + lastname)
print(firstname*3)
print('Chris' in firstname)

Christopher Brooks
ChristopherChristopherChristopher
True


E.g.3 Split string into a list of substrings on space or a specific letter.

In [42]:
namelist = 'Christopher Arthur Hansen Brooks'.split()
firstname = 'Christopher Arthur Hansen Brooks'.split('e')[0] # [0] selects the first element of the list
lastname = 'Christopher Arthur Hansen Brooks'.split('e')[-1] # [-1] selects the last element of the list
print(namelist)
print(firstname)
print(lastname)

['Christopher', 'Arthur', 'Hansen', 'Brooks']
Christoph
n Brooks


E.g.4 Convert objects to strings before concatenating.

In [43]:
'Chris' + str(2)

'Chris2'

## 1.4 Dictionary
1. Dictionaries associate keys with values. They are declared by including the pairs of key and value within { }, i.e. <font color="red"><b>dict = {key1: value1, key2: value2}</b></font>.<br/>
2. Dictionary value can be accessed with a key, i.e. <font color="red"><b>dict[key1]</b></font>.<br/>
3. Dictionary can be extended by providing additional pairs of key and value, i.e. <font color="red"><b>dict[key3] = value3</b></font>.</br>
4. Dictionary keys, values, key-value pairs can be accessed by <font color="red"><b>dict.keys(), dict.values(), dict.items()</b></font> respectively.

E.g.1 Declare a dictionary for names and emails, access the email address with given game.

In [44]:
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}
x['Christopher Brooks']

'brooksch@umich.edu'

E.g.2 Extend the dictionary with additional pair of key and value.

In [56]:
x['Wang Shenghao'] = 'wangshenghao1993@gmail.com'
x

{'Bill Gates': 'billg@microsoft.com',
 'Christopher Brooks': 'brooksch@umich.edu',
 'Wang Shenghao': 'wangshenghao1993@gmail.com'}

E.g.3 Iterate over all of the names as keys, print each email values.

In [50]:
for name in x:
    print(x[name])

brooksch@umich.edu
billg@microsoft.com
wangshenghao1993@gmail.com


E.g.4 Iterate over all of the email values.

In [52]:
for email in x.values():
    print(email)

brooksch@umich.edu
billg@microsoft.com
wangshenghao1993@gmail.com


E.g.5 Iterate over all of the items (pairs of key and value)

In [57]:
for name, email in x.items():
    print(name)
    print(email)

Christopher Brooks
brooksch@umich.edu
Bill Gates
billg@microsoft.com
Wang Shenghao
wangshenghao1993@gmail.com


# II. Reading and Writing Data Files

Data files with different formats can be imported and processed in Python. The common formats include .csv, .xls, .json, .txt. Pandas library is frequently used to load external data files. In addition, after the data is cleaned, it can also be exported in different formats for future use.

## 2.1 Import csv file as dataframe
In Python, CSV data can be imported as dataframe or a list of dictionary objects. In the most common case, csv files are read as dataframes with the help of <b>pd.read_csv()</b> function, i.e. <font color="red"><b>df = pd.read_csv('data.csv', skiprows = row # or a list of row #)</b></font>. Note that the <b>read_csv()</b> function can carry multiple parameters besides the file name. <b>skiprows</b> parameter is very useful when some redundant data exists in the CSV file.<br/>
<font color="red"><b>If skiprows = n, 1 to n rows of data will be ignored when data is loaded.<br/>
If skiprows = [row #], the rows of data in the list will be ignored.</b></font>

E.g Load "world_bank.csv" as a dataframe and skip the first four rows.

In [3]:
import pandas as pd

df = pd.read_csv('world_bank.csv', skiprows=4)
df.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Aruba,ABW,GDP at market prices (constant 2010 US$),NY.GDP.MKTP.KD,,,,,,,...,,,,,2467704000.0,,,,,
1,Andorra,AND,GDP at market prices (constant 2010 US$),NY.GDP.MKTP.KD,,,,,,,...,4018196000.0,4021331000.0,3675728000.0,3535389000.0,3346317000.0,3185605000.0,3129538000.0,3127550000.0,,
2,Afghanistan,AFG,GDP at market prices (constant 2010 US$),NY.GDP.MKTP.KD,,,,,,,...,10305230000.0,11721190000.0,12144480000.0,14697330000.0,15936800000.0,16911130000.0,19352200000.0,19731340000.0,19990320000.0,20294150000.0


## 2.2 Import csv file as a list of dictionary objects
CSV file can also be imported as a list of dictionaries with the <b>DictReader</b> function in the csv library. The function is used in the following manner.
<font color="red"><b>with open('data.csv') as csvfile: dictlist = list(csv.DictReader(csvfile))</b></font>

E.g This example comes from the course material. Load the data from "mpg.csv" as a list of dictionaries, and display the first three dictionary objects.

In [4]:
import csv

%precision 2

with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))
    
mpg[:3] # The first three dictionaries in our list.

[OrderedDict([('', '1'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '1.8'),
              ('year', '1999'),
              ('cyl', '4'),
              ('trans', 'auto(l5)'),
              ('drv', 'f'),
              ('cty', '18'),
              ('hwy', '29'),
              ('fl', 'p'),
              ('class', 'compact')]),
 OrderedDict([('', '2'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '1.8'),
              ('year', '1999'),
              ('cyl', '4'),
              ('trans', 'manual(m5)'),
              ('drv', 'f'),
              ('cty', '21'),
              ('hwy', '29'),
              ('fl', 'p'),
              ('class', 'compact')]),
 OrderedDict([('', '3'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '2'),
              ('year', '2008'),
              ('cyl', '4'),
              ('trans', 'manual(m6)'),
              ('drv',

## 2.3 Import xls file as dataframe
Similar to csv files, xls files can be imported with the <b>read_excel()</b> function from the Pandas library. <b>skiprows</b> and <b>skipfooter</b> parameters can be used to ignored the redundant rows of data. Note that it is the <b>no. of rows of data to be skipped at the bottom of the xls file</b> that needs to be assigned to the skipfooter parameter. Since a xls file may contain multiple spreadsheets, sheet name needs to be specified when there are multiple spreadsheets.<br/>
<font color="red"><b>df = pd.read_excel('data.xls', sheet_name = 'xxx', skiprows = n1, skipfooter = n2)</b></font>

E.g.1 Load the "Energy" spreadsheet in "Energy Indicators.xls". Skip the first 17 rows at the top and last 38 rows at the bottom of the xls file. Do open the file and carefully observe the format.

In [9]:
df = pd.read_excel('Energy Indicators.xls', sheet_name='Energy', skiprows=17, skipfooter=38)
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Petajoules,Gigajoules,%
0,,Afghanistan,Afghanistan,321,10,78.66928
1,,Albania,Albania,102,35,100.0
2,,Algeria,Algeria,1959,51,0.55101
3,,American Samoa,American Samoa,...,...,0.641026
4,,Andorra,Andorra,9,121,88.69565


E.g.2 Load the only spreadsheet in "gdplev.xls", and skip the first 5 rows and the 7th, 8th row. Hint: create a list called "skip" to save the no. of rows to be ignored.

In [10]:
skip = [i for i in range(0, 5)] + [6, 7]
df = pd.read_excel("gdplev.xls", skiprows=skip)
df.head()

Unnamed: 0.1,Unnamed: 0,GDP in billions of current dollars,GDP in billions of chained 2009 dollars,Unnamed: 3,Unnamed: 4,GDP in billions of current dollars.1,GDP in billions of chained 2009 dollars.1,Unnamed: 7
0,1929.0,104.6,1056.6,,1947q1,243.1,1934.5,
1,1930.0,92.2,966.7,,1947q2,246.3,1932.3,
2,1931.0,77.4,904.8,,1947q3,250.1,1930.3,
3,1932.0,59.5,788.2,,1947q4,260.3,1960.7,
4,1933.0,57.2,778.3,,1948q1,266.2,1989.5,


## 2.4 Import json file
Json files can be loaded as dataframes or lists of objects and converted to other formats of data based on needs.<br/>
Import json as dataframe: <font color="red"><b>df = pd.read_json('data.json')</b></font><br/>
Import json as list of objects: <font color="red"><b>jsonlist = json.loads(open('data.json').read())</b></font><br/>

E.g.1 Read 'crop_top3.json' as a list of objects, and display the list.

In [11]:
import json

jsonlist = json.loads(open('crop_top3.json').read())
jsonlist

[{'error': [],
  'method': 'image/recognize',
  'reqid': '709980431671578624',
  'result': [{'objects': [{'box': [400, 371, 819, 875],
      'tags': [{'score': 0.99, 'tag': 'top'}]},
     {'box': [306, 885, 880, 1499],
      'tags': [{'score': 0.77, 'tag': 'bottom'}]}],
    'tag_group': 'product_detection'}],
  'status': 'OK'},
 {'error': [],
  'method': 'image/recognize',
  'reqid': '709980434926358528',
  'result': [{'objects': [{'box': [123, 83, 251, 225],
      'tags': [{'score': 0.99, 'tag': 'top'}]},
     {'box': [116, 233, 270, 362], 'tags': [{'score': 0.68, 'tag': 'skirt'}]},
     {'box': [111, 108, 262, 326], 'tags': [{'score': 0.24, 'tag': 'other'}]}],
    'tag_group': 'product_detection'}],
  'status': 'OK'},
 {'error': [],
  'method': 'image/recognize',
  'reqid': '709980437405196288',
  'result': [{'objects': [{'box': [164, 54, 269, 179],
      'tags': [{'score': 0.99, 'tag': 'top'}]},
     {'box': [142, 168, 285, 369],
      'tags': [{'score': 0.78, 'tag': 'bottom'}]}],
 

E.g.2 Read 'crop_top3.json' as a dataframe, and display the dataframe.

In [13]:
df = pd.read_json('crop_top3.json')
df

Unnamed: 0,error,method,reqid,result,status
0,[],image/recognize,709980431671578624,"[{'objects': [{'box': [400, 371, 819, 875], 't...",OK
1,[],image/recognize,709980434926358528,"[{'objects': [{'box': [123, 83, 251, 225], 'ta...",OK
2,[],image/recognize,709980437405196288,"[{'objects': [{'box': [164, 54, 269, 179], 'ta...",OK


## 2.5 Read txt files line by line as lists
Txt files can be read with the default <b>readlines()(</b> function after it is opened in Python environment. The txt data would be saved in a list, which can be further converted to dataframe by <b>splittng the data in each row and inserting into dataframe</b>.<br/>
Open the .txt file in Python: <font color="red"><b>f = open('data.txt', 'r')</b></font>;<br/>
Read the data line by line into a list: <font color="red"><b>txtlist = f.readlines()</b></font>;<br/>
Close the .txt file: <font color="red"><b>f = close()</b></font>

E.g. Load 'university_towns.txt' as a list, and convert it to a dataframe with two columns 'State' and 'RegionName'.

In [18]:
f = open('university_towns.txt', 'r')
utown_list = f.readlines()
f.close()

In [20]:
utown_list

['Alabama[edit]\n',
 'Auburn (Auburn University)[1]\n',
 'Florence (University of North Alabama)\n',
 'Jacksonville (Jacksonville State University)[2]\n',
 'Livingston (University of West Alabama)[2]\n',
 'Montevallo (University of Montevallo)[2]\n',
 'Troy (Troy University)[2]\n',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]\n',
 'Tuskegee (Tuskegee University)[5]\n',
 'Alaska[edit]\n',
 'Fairbanks (University of Alaska Fairbanks)[2]\n',
 'Arizona[edit]\n',
 'Flagstaff (Northern Arizona University)[6]\n',
 'Tempe (Arizona State University)\n',
 'Tucson (University of Arizona)\n',
 'Arkansas[edit]\n',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]\n',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]\n',
 'Fayetteville (University of Arkansas)[7]\n',
 'Jonesboro (Arkansas State University)[8]\n',
 'Magnolia (Southern Arkansas University)[2]\n',
 'Monticello (University of Arkansas at Montice

In [23]:
uni_towns_df = pd.DataFrame(columns=["State", "RegionName"])
for item in utown_list:
    if '[edit]' in item:
        state = item.split('[')[0] # For "State", removing characters from "[" to the end.
    else:
        # For "RegionName", when applicable, removing every character from " (" to the end.
        region = item[0:(item.find('(')-1)]
        new_region = pd.DataFrame([[state, region]], columns=['State', 'RegionName'])
        uni_towns_df = uni_towns_df.append(new_region, ignore_index=True)

uni_towns_df.head()

Unnamed: 0,State,RegionName
0,Alabama,Auburn
1,Alabama,Florence
2,Alabama,Jacksonville
3,Alabama,Livingston
4,Alabama,Montevallo


## 2.6 Write dataframe into csv file
Dataframes can be written into csv files with the <b>to_csv</b> function. Set <b>index</b> parameter to False to exclude the index column in the output csv file.<br/>
<font color="red"><b>df.to_csv('output.csv', index=False)</b></font>

E.g. Export the dataframe of the university towns into 'utown.csv' without index.

In [24]:
uni_towns_df.to_csv('utown.csv', index=False)