## Workbook 1: Basics in Python

All resources, unless otherwise specified, are created by Nikhil Chinchalkar

Python is an great tool for both general software engineering as well as data science. The beauty of the langauge comes in its simplicity: unlike other programs, the syntax is pretty straight forward and easy to learn. This is just a (kinda) short overview of some of the tools and techniques you might need in your data science journey. This won't cover all of Python, for that you should take a course like CS 1110 nor will it go as in depth into data science techniques as a course like INFO 2950. 

Instead, to be 100% honest, my goal for these types of workbooks is for you to be able to properly Google the answers to certain questions you might have and error messages you encounter. That's what programming is at the end of the day.

### A: Libraries, Loops, Lists

Libraries are sets of functions that we can import into our project to help solve certain problems: how to read in a csv, how to merge two datasets, how to make a scatter plot, etc. These libraries come with documentation that's very useful. You should pretty much never read the documentation front to back - rather, reference it when you're getting errors or unsure of functions. 

We'll import two standard libraries for now, using the keyword `as` to create nicknames for the libraries, so it's easier to reference later.

In [3]:
import numpy as np
import pandas as pd

If the above gave you an error, try commenting out the line below and re-running it once the stuff below has finished.

In [None]:
#%pip install numpy
#%pip install pandas

Collecting numpy
  Downloading numpy-2.3.3-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Downloading numpy-2.3.3-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m22.8 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-2.3.3
Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Downloading pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl (10.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m16.9 MB/s[0m  [33m0:00:00[0mm0:00:01[0m0:01[0m
[?25hDownloading pytz-2025.2-py2.py3-none-any.whl

`NumPy` is used a lot of basics like math functions and arrays. Arrays are bascially lists of objects of the same type: so either strings (words) or integers (numbers) or floats (decimals), to name a few. There's a ton of function that we can talk about with arrays, but we'll stick to some basic ones, for now. Note that there's also stuff similar to arrays, like lists and tuples, but we're not going to talk about those right now, since they're very similar to arrays.

I'm initializing below an array of the first 100 integers, starting from 0 (up to 99). This is what we'll be using to learn about some of the functions.

In [6]:
random_list = np.arange(100)
print (random_list)


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


Print out the array to make sure you get the expected list.

In [6]:
#A1 your code here:

Expected output:

`[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]`

First, we'll learn about indexing, which is just a fancy name for accessing certain elements of the array. Most things in Python are 0-indexed, which just means that the first item of the list is at position 0. This might seem counter intuitive now, but as you learn more about programming you'll learn that having things 0-indexed makes certain algorithms way easier.

In [15]:
#the [ ] allows us to access elements from an array
print(random_list[0])

#the : operator can be used to access elements across a certain range - note that 15 is exclusive, whereas 5 is inclusive
print(random_list[5:15])

#we can also not use a number at either end of our : to indicate we want python to go all the way to the end of that direction
print(random_list[95:])

print(random_list[:5])

0
[ 5  6  7  8  9 10 11 12 13 14]
[95 96 97 98 99]
[0 1 2 3 4]


It's your turn now! Print out the following using the same type of syntax (i.e. don't using 10 print statements to print each number):

In [21]:
#A2 your code here:
#print the element at index 90 of the array
print(random_list[90])
#print the elements from index 10 to index 20
print(random_list[10:21])
#print the last 10 elements of the array
print(random_list[len(random_list)-10:])

90
[10 11 12 13 14 15 16 17 18 19 20]
[90 91 92 93 94 95 96 97 98 99]


Expected output:

```
90
[10 11 12 13 14 15 16 17 18 19 20]
[89 90 91 92 93 94 95 96 97 98 99]
```

We can also do math on these lists: let's say I wanted the list to start at 1 instead of 0. I can just add 1 to each element of the list to achieve that effect:

In [9]:
random_list = random_list + 1

print(random_list)

[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100]


A couple of things about that previous call: if you run it a bunch of times, you might be adding more than 1 to your array. Second, the first line is obviously not mathematically sound, but it is OK to do in programming. All it means is that we take `random_list`, add 1 to it, then set it back equal to `random_list` again, effectively overwriting it's previous version.

We can do the same for a natural log, using the `np.log` command:

In [10]:
print(np.log(random_list))

[0.         0.69314718 1.09861229 1.38629436 1.60943791 1.79175947
 1.94591015 2.07944154 2.19722458 2.30258509 2.39789527 2.48490665
 2.56494936 2.63905733 2.7080502  2.77258872 2.83321334 2.89037176
 2.94443898 2.99573227 3.04452244 3.09104245 3.13549422 3.17805383
 3.21887582 3.25809654 3.29583687 3.33220451 3.36729583 3.40119738
 3.4339872  3.4657359  3.49650756 3.52636052 3.55534806 3.58351894
 3.61091791 3.63758616 3.66356165 3.68887945 3.71357207 3.73766962
 3.76120012 3.78418963 3.80666249 3.8286414  3.8501476  3.87120101
 3.8918203  3.91202301 3.93182563 3.95124372 3.97029191 3.98898405
 4.00733319 4.02535169 4.04305127 4.06044301 4.07753744 4.09434456
 4.11087386 4.12713439 4.14313473 4.15888308 4.17438727 4.18965474
 4.20469262 4.21950771 4.2341065  4.24849524 4.26267988 4.27666612
 4.29045944 4.30406509 4.31748811 4.33073334 4.34380542 4.35670883
 4.36944785 4.38202663 4.39444915 4.40671925 4.41884061 4.4308168
 4.44265126 4.4543473  4.46590812 4.47733681 4.48863637 4.49980

That last cell is important - I'll explain a bit more about functions later, but all the `np.log` is doing is taking in the list and taking the log of each element. To know what to use to achieve a desired effect like the natural log just comes from Googling how to do that task in Python.

There's a few things we have left to do in the world of arrays. What if we wanted to make a new array of only even numbers? We could create a new array from scratch, Googling how to initialize it, or we could take our array of integers from (now) 1 to 100, and filter out the odds. To do so, we need to learn about boolean indexing. Booleans are a type in Python that can only be either `True` or `False`. For instance, `5 % 2 == 0`, which tests 5 against 2 using the modulus operator (so it returns the remainder when 5 is divided by 2), will return `False`, since `5 % 2` is 1, and `1 == 0` is not true. Note that here, `==` is an equality operator not an assignment operator, like `=` is. 

In [11]:
#boolean index array that shows where all the odds (falses) are:
print(random_list % 2 == 0)

#taking the boolean index array into the new array allows us to filter out the falses
even_list = random_list[random_list % 2 == 0]
print(even_list)

[False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True False  True False  True False  True False  True
 False  True False  True]
[  2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72
  74  76  78  80  82  84  86  88  90  92  94  96  98 100]


The last thing we'll end on is `for` loops, which can be used in certain specific data science applications. Let's say we wanted to add the next 50 evens, to have our list of evens go from 2 to 200. We can do that with a `for` loop, which would just cycle through each even, and add it to our list. This is way faster than doing it manually, which is why it's preferred by programmers.

Note that in the following code segment, I'm using a function called np.append that takes in two parameters: even_list (the list I want to append to) and 2*i, which is the value I want to append. This format of functions and parameters is something that will come up a lot later, so make sure you understand what a function is and how parameters are passed to it.

In [12]:
#for i in range just means for each value in the range from 51 and 101 (exclusive)
for i in range(51,101):
    #we use the np.append function to append 2*i, which will range from 102 to 200 (evens only)
    even_list = np.append(arr=even_list, values=2*i)
    
print(even_list)

[  2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36
  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72
  74  76  78  80  82  84  86  88  90  92  94  96  98 100 102 104 106 108
 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144
 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180
 182 184 186 188 190 192 194 196 198 200]


Now we'll put everything together. Once you complete this mini-assignment, you should feel confident about your ability to learn about NumPy arrays, which are the foundation of Pandas dataframes, which is likely how you'll be spending most of your programming time in CDJ. 

I've made a `test_list` array object that has a bunch of numbers. I want you to do the following:

 1. Add 100 to each number in the list
 2. Use [np.power()](https://numpy.org/doc/2.1/reference/generated/numpy.power.html) to take 10 to the power of each value in `test_list` (i.e. set 10 as the base and each value in `test_list` as the exponent)
 3. Use [np.round()](https://numpy.org/doc/2.1/reference/generated/numpy.round.html) to round each of the values in the list to the nearest whole number
 4. Print the first 5 elements of the list

The expected output won't be provided here, but you'll know if you did the steps right if you see a very recognizable pattern in your output.

In [36]:
#A3 your code here
test_list = np.array([ -99.52287875, -100.        ,  -99.39794001, -100.        ,
        -99.30103   ,  -99.04575749,  -99.69897   ,  -99.22184875,
        -99.30103   ,  -99.52287875,  -99.30103   ,  -99.09691001,
        -99.04575749,  -99.15490196,  -99.04575749,  -99.52287875,
        -99.69897   ,  -99.52287875,  -99.09691001,  -99.39794001,
        -99.22184875,  -99.69897   ,  -99.22184875,  -99.39794001])
print(test_list+100)
print(np.power(10,test_list))
print(np.round(test_list))
print(test_list[:5])

[0.47712125 0.         0.60205999 0.         0.69897    0.95424251
 0.30103    0.77815125 0.69897    0.47712125 0.69897    0.90308999
 0.95424251 0.84509804 0.95424251 0.47712125 0.30103    0.47712125
 0.90308999 0.60205999 0.77815125 0.30103    0.77815125 0.60205999]
[2.99999997e-100 1.00000000e-100 3.99999999e-100 1.00000000e-100
 4.99999995e-100 9.00000001e-100 2.00000002e-100 5.99999999e-100
 4.99999995e-100 2.99999997e-100 4.99999995e-100 8.00000006e-100
 9.00000001e-100 7.00000000e-100 9.00000001e-100 2.99999997e-100
 2.00000002e-100 2.99999997e-100 8.00000006e-100 3.99999999e-100
 5.99999999e-100 2.00000002e-100 5.99999999e-100 3.99999999e-100]
[-100. -100.  -99. -100.  -99.  -99. -100.  -99.  -99. -100.  -99.  -99.
  -99.  -99.  -99. -100. -100. -100.  -99.  -99.  -99. -100.  -99.  -99.]
[ -99.52287875 -100.          -99.39794001 -100.          -99.30103   ]


### B: Dataframes and Pandas

If the above section seemed a little boring, that's fine: it kind of should be. When I started learning data science I felt the same way - I wanted to see a real world application. You'll soon see that in your own groups, but I can give you a taste of what's to come.

[Gapminder](https://www.gapminder.org/tools/#chart-type=bubbles&url=v2) is one of the most popular animated and interactive visuals. It's something that we'll recreate from scratch later in NME, but for now I'll just show you how to access the data, which we'll manipulate later.

You can download the data from [here](https://github.com/kirenz/datasets/blob/master/gapminder.csv).

In [38]:
gapminder = pd.read_csv('gapminder.csv')

gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


We just imported a dataframe from a .csv file, which is probably what most of you will be doing in your project teams. A dataframe is like a 2-D list, with rows and columns. Importantly, each column is a `Series` object, which is very similar to a list, so we can perform some very similar operations like the ones we saw above.

Let's start with indexing dataframes. There's a bunch of ways to do this and I'll give a bunch of examples for you to reference:

`.loc` and `.iloc` are standard ways to access certain locations of data, either based on an integer position or a label. You can probably guess that `.iloc` uses an integer position, whereas `.loc` uses a boolean or label. Most of the time, you'll use `.loc`, but `.iloc` is still good to know.

In [15]:
#this prints the data in the row at index number 1
print(gapminder.iloc[1])

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object


In [16]:
#since the index in our dataframe is also integers, this also works with .loc
print(gapminder.loc[1])

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object


In [17]:
#prints the data in the first 5 rows
print(gapminder.head(5))

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


In [18]:
#prints the data in the first 5 rows
print(gapminder.iloc[0:5])

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


In [19]:
#prints the data in the column 'continent'
print(gapminder.loc[:,'continent'])

0         Asia
1         Asia
2         Asia
3         Asia
4         Asia
         ...  
1699    Africa
1700    Africa
1701    Africa
1702    Africa
1703    Africa
Name: continent, Length: 1704, dtype: object


In [20]:
#prints the data in the column 'continent'
print(gapminder['continent'])

0         Asia
1         Asia
2         Asia
3         Asia
4         Asia
         ...  
1699    Africa
1700    Africa
1701    Africa
1702    Africa
1703    Africa
Name: continent, Length: 1704, dtype: object


In [21]:
#prints the data in the columns 'continent' and 'country' 
print(gapminder[['continent','country']])

     continent      country
0         Asia  Afghanistan
1         Asia  Afghanistan
2         Asia  Afghanistan
3         Asia  Afghanistan
4         Asia  Afghanistan
...        ...          ...
1699    Africa     Zimbabwe
1700    Africa     Zimbabwe
1701    Africa     Zimbabwe
1702    Africa     Zimbabwe
1703    Africa     Zimbabwe

[1704 rows x 2 columns]


We could keep going about these types of arrangements, but the point is that there's a bunch of ways to do the same thing in Pandas. Just choose whatever works and makes sense to you.

The last thing we'll look at in this part of NME is with filtering dataframes. There's another way to do this that's much easier, but it's good to know how to do it in Python (I'll teach you the easy way later). Just like the boolean indexing we did with lists, the same stuff applies here, since dataframes are basically just combined lists. The data we loaded in only goes up to 2007 and starts at 1957. Let's say we didn't care about data from dates other than 1980 to 2000. We go about filtering the non-needed data in a familiar style:

In [22]:
filtered_gapminder = gapminder[(gapminder['year'] >= 1980) & (gapminder['year'] <= 2000)]

filtered_gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351
18,Albania,Europe,1982,70.420,2780097,3630.880722
...,...,...,...,...,...,...
1689,Zambia,Africa,1997,40.238,9417789,1071.353818
1698,Zimbabwe,Africa,1982,60.363,7636524,788.855041
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786


Note that the parenthesis are very important here - try removing them and see what happens. This occurs because Python places higher emphasis on the `&` operator than the `>=` comparison, so it misinterprets what we actually want. Also, that `&` is just an 'and'. 

Now it's your turn: I want you to do the following:

1. Filter the `gapminder` dataset to just include the data for 'United States', put this into a new dataframe called `great_gapminder`
2. Filter the `great_gapminder` dataset to just include the data for years that end in 7 (remember to re-assign it to `great_gapminder`)
   1. This might be kinda tricky - think about how we might use the `%` operator
3. Print `great_gapminder` to show just the `year` and `gdpPercap` columns

In [49]:
#B1: your code here
great_gapminder = gapminder[(gapminder['country'] == 'United States')]
#print (great_gapminder)

great_gapminder = great_gapminder[(great_gapminder['year']%10==7)]
#print (great_gapminder)

print(great_gapminder[['year','gdpPercap']])



      year    gdpPercap
1609  1957  14847.12712
1611  1967  19530.36557
1613  1977  24072.63213
1615  1987  29884.35041
1617  1997  35767.43303
1619  2007  42951.65309


Expected Output:

```
      year    gdpPercap
1609  1957  14847.12712
1611  1967  19530.36557
1613  1977  24072.63213
1615  1987  29884.35041
1617  1997  35767.43303
1619  2007  42951.65309
```

That's pretty much the basics of indexing dataframes and filtering them using Pandas. Obviously there's way more stuff that you could do and way more things for me to teach, but it's really not practical: nobody knows NumPy and Pandas that well - documentation exists for a reason, and Google is your friend. If you want to try to do something in Python, just Google how to do that skill and there'll be thousands of people that have had that same exact desire in the past.

The point of this workbook is just to make sure that you understand *what* your actually Googling and making sure you're not Googling everything.