# Use Data Science and APIs to Support Your Collections Work

# Part I: Getting Started

Welcome to this workshop on using basic data science skills, and data from the tools that many of our libraries already subscribe to, to support your collections work!

### The plan
We're going to work with Python, and do this entire workshop within this Jupyter Notebook. The intended audience for this workshop is the librarian with no programming skills. In this three-hour workshop, we will:
* [set-up a Jupyter Notebook environment, and get oriented](#Jupyter-Notebook)(Part I)
* [a crash-course on programming in Python](#A-crash-course-in-Python) (Part I)
* querying the Scopus API to get data, and import it into pandas (Part II)
* data analysis and manipulation using pandas (Part II)
* use Python to access a library link resolver (Part II)

### Acknowledgments, credits and copyright
This workshop was developed by Roger Reka (University of Windsor) for the 2019 Canadian Library Assessment Workshop. This material is licensed under a [Creative Commons CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/); you are free to re-use this material and/or adapt it for your own needs.

Some of the material in this workshop, specifically for the Python introduction section, is adapted from the Data Carpentry [*Data Analysis and Visualization in Python for Ecologists*](https://datacarpentry.org/python-ecology-lesson/) lesson, which is also licensed with a CC-BY license.

This workshop is loosely based on an ongoing project of mine to create tools to automate collections assessment. Find the [project on GitHub](https://github.com/rreka/Automating-collections-assessment).

We are making use of bibliographic data from Elsevier's Scopus database. We will access the Scopus database using an API, in accordance with the terms and conditions of the API service. All data remains the property of Elsevier.

Finally, we will be accessing the Scopus API using a [Python module called pybliometrics](https://pybliometrics.readthedocs.io/en/stable/) devoloped by Michael E. Ross and John R. Kitchin. pybliometrics makes accessing the API much easier in Python, and we thank Michael and John for developing this helpful tool.

***
## Jupyter Notebook

Jupyter Notebook is an **integrated development environment** (IDE) for Python and some other programming languages. While Python can be used straight from the terminal or command line, using an IDE gives us a working space with all the tools we need to write code. Jupyter Notebook is especially helpful because it visualizes data nicely for us, and allows us to add text to the code to make it easier to read in a workshop.

There are many other IDEs for Python, and some other ones are included in the Anaconda distribution that you downloaded.

### Cells
Cells are the fields where we add our code or our text in Jupyter Notebook.

In [None]:
# This is a cell in a Jupyter Notebook with a comment

#### Running cells
In order to run the code in a cell, simply select the cell and click on the **Run** button. 

Alternatively, you can use the shortcut `Control + Enter` or `Shift + Enter`.

<img src="img/run-cell.png"/>

**You try!** Run the code in the cell below.

In [None]:
print('Hello world!')

#### Stop the cell 
Sometimes you may need to stop the code that you're executing. You may accidentally start an infinite loop, or run some code that takes too long to process. Simply press the **Stop** button to stop the code.

Next to each cell there is a field `In [ ]:`. When Python has sucessfully executed the code in the cell, a number will be placed within the square brackets indicating what order the cell was executed in. When Python is processing the code, a `*` will appear in the square brackets.

<img src="img/stop-cell.png"/>

**You try!** The code in the cell below is an infinite loop. Run the cell and then stop the cell.

In [None]:
n = 1
while True:
    n = n + 1
    print('hello' + str(n))

***

## A crash-course in Python

### What is Python?

Python is a high-level general purpose programming language. It is widely adopted in the world of software development, data analysis, and increasingly, in research. Python is one of the easiest programming languages to pick-up and start learning, and has relatively low barriers to entry.

Python is: 
* Free
* Open-source
* Widely adopted with a large user community
    * Lots of third-party packages
* Supported on all operating systems

### Getting help
There are built-in features within Python to seek help as you're working.

You can access the Python help menu by typing `help()`.

In [None]:
help()

You can also get specific information about an object or function by typing `?object` or `help(object)`.

In [None]:
help(print)

If you are stuck about something in Python, a generic Google search will often send you to some helpful information. The documentation for Python, for modules or questions by other coders can be found online. Just search for **"Python <whatever task you're trying to do>"**.

Many answers can be found online on internet forums such as Stack Overflow.

### Learning more about Python
You can learn more about programming in Python by attending a Software Carpentry workshop near you, or by taking an online course. There are many online courses to choose from to learn Python, but I recommend Code Academy and DataCamp. Both have free plans for the courses. 

### Python data types
Everything in Python has a type. Understanding the type of data is important for working in Python.

#### Numbers
There are two types of numbers in Python: `int` and `float`. An `int` refers to a number that is an integer, while a `float` refers to a number that has a decimal point.

In [None]:
13 # this is an int

In [None]:
3.14159 # is a float

#### Strings
Strings hold sequences of characters, which can be letters, numbers, punctuation or other forms of text. Strings must be bounded by quotation marks `'` or `"`. A string is of type `str`.

In [None]:
'Hello, this is a string' # a string

In [None]:
"Me 2, I am also a string" # a string that contains a number as a string.

#### Booleans
`True` and `False` are boolean data types. 

In [None]:
True # this is not a string, but rather a boolean type

In [None]:
False # this also is not a string; this is a boolean

### Mathematic operations
We can use Python as a simple calculator using the basic operators that we're all familiar with.
<br>
`+`: addition
<br>
`-`: subtraction 
<br>
`/`: division
<br>
`*`: multiplication
<br>
`**`: exponentiation

In [None]:
2 + 2 # addition

In [None]:
2 - 2 # subtraction

In [None]:
41 / 22 # division

In [None]:
1121 * 238 # multiplication

In [None]:
2 ** 6 # exponentiation

Python respects the order of operations (BEDMAS).

In [None]:
(10 - 5) * 3 + 5

***
<div class="alert alert-block alert-success">
<b>Challenge 1</b> 
</div>

1. Practice working around in Python by using it as a calculator and doing some basic math. 
2. Combine the following strings together: `'hello'` and `'world'` using the math operators.
3. So far we've been using the math operators on data of the same type. What happens when you use these operators on operators of **different** types (e.g., `str` and `int`, or `int` and `float`)?

***

### Comparison and logical operators
We can also use comparison and logical operators on our data. This will return a boolean.

#### Comparison operators
`<`: less than
<br>
`>`: greater than
<br>
`==`: equal
<br>
`!=`: not equal
<br>
`<=`: less than or equal to
<br>
`>=`: greater than or equal to

#### Logical operators
`and`: combines the two items
<br>
`or`: either of the two items
<br>
`not`: exlcude the second item

In [None]:
3 > 4

In [None]:
3 < 4

In [None]:
3 == 4

In [None]:
3 != 4

In [None]:
(10 + 5) <= 30

In [None]:
(6-3) < (9+10)

In [None]:
True and True

In [None]:
True or False

In [None]:
True and False

In [None]:
(10 < 3) and (6 > 2)

### Variables & objects
Inputing our data straight into Python is useful, but this is not the way that we usually do things in Python. We will almost always create objects and **assign** the data to them. We do so by using the assignment operator `=`. Note that the assignment operator is different than the equal operator `==`.

Variable names contain text, and cannot begin with a number. There can be no spaces in a variable name, so we can use underscores or title case when we have multiple words in a variable name.

The variable is the text name of the object. The object is the data itself.

In [None]:
some_text = "This is my text!" # assigning some string data to the variable 'some_text'

Notice that after we assign data to a variable, that there is no output from Python. We must call the variable again to see the data. We can do this in two ways: we can simply write the name of the variable, or when working within scripts, we must `print()` the variable to see the data.

In [None]:
some_text # call the variable 

In [None]:
print(some_text) # using the print() function to print out the data in the variable.

In [None]:
number = 40 # create & assign a variable
number # call the variable

In [None]:
pi_value = 3.14159 # assigning a float to a variable

We check to see what type the variable is by using the built-in function `type()`.

In [None]:
type(some_text)

In [None]:
type(number)

In [None]:
type(pi_value)

We can also use mathematical operators on variables, just like we did with just the data earlier.

In [None]:
number + pi_value

In [None]:
some_text + ' I make the best sentences.'

In [None]:
number + some_text

### Data structures: lists, tuples, and dictionaries
We often want to store a sequence of data, as opposed to a single piece of data. Lists, tuples and dictionaries are some of the most common data structures in Python.

#### Lists
Lists store ordered data of a single type. Lists can be differentiated by the use of square brackets `[ ]`.

In [None]:
empty_list = [] # creating an empty list
empty_list

In [None]:
type(empty_list)

In [None]:
numbers = [1, 2, 3, 4]
numbers

In [None]:
letters = ['a', 'b', 'c', 'd', 'e']
letters

We can even create a lists of lists.

In [None]:
many_lists = [[2, 4, 6], [1, 2, 3]]
many_lists

We can **subset** specific data from a list by knowing its **index**, or place in the list. In Python, numbering for indexes begins at 0 and ends at -1.

` ['a', 'b', 'c', 'd', 'e']
   0    1    2    3    4
`

In order to subset a piece of data, we call the variable name, add square brackets next to the name, and enter the index of the data we're interested in.

In [None]:
letters[0]

In [None]:
letters[4]

In [None]:
letters[-1]

In [None]:
letters[0:2]

In [None]:
letters[2:]

In [None]:
many_lists[1][2]

We can add elements to the end of the list using the `append` method. A method is a type of built-in function that acts on an object, like a variable. We call the method by using a period `.` followed by the method name and the arguments in the parentheses.

In [None]:
numbers

In [None]:
numbers.append(5)

In [None]:
numbers

A helpful function to use with lists, tuples and dictionaries is `len()`. Use `len()` to describe the length of a sequence, such as lists.

In [None]:
len(numbers)

We can also get to all of the data in a list by creating a `for` loop. A `for` loop is one of the more common programming tasks that we do in Python, and we'll use it often in our work. A `for` loop iterates over every item in the sequence.

In [None]:
for number in numbers:
    print(number)

In [None]:
for n in numbers:
    print(n + 10)

#### Tuples
Tuples are also ordered sequences of data, like lists, but they can contain data of different types and can't be changed once created.

Tuples can be differentiated by the use of parantheses `( )`.

In [None]:
colours = ('red', 'green', 'blue', 'yellow')
colours

In [None]:
colours[1]

In [None]:
for colour in colours:
    print(colour + ' is a nice colour.' )

You can also subset tuples, like lists, by passing the index of the data that you'd like to retrieve in square brackets `[ ]`.

***
<div class="alert alert-block alert-success">
<b>Challenge 2</b> 
</div>

1. Create a list with at least 8 numbers.
2. Use a `for` loop to print out all the numbers in the list.
3. Create a tuple that contains both strings and numbers.
4. What happens when you execute `numbers[1] = 100`?
5. What happens when you execute `colours[0] = 'pink'`?
***

#### Dictionaries
Dictionaries are collections of **unordered data** that are arranged in key-value pairs.

In [None]:
names = {
    'Gordon': 'West',
    'Saanvi': 'Roy',
    'Faye': 'Philip',
    'Israa': 'Faisal'
}

In [None]:
names

In [None]:
names[3]

In [None]:
names['Israa']

We can add an item to a dictionary by specifying the new key and its value.

In [None]:
names['Precious'] = 'Navarro'

In [None]:
names

### Compound statements
So far we've been working with individual statements, running them in one-off situations, such as:

`print('hello world')`

**Compound statements** contain one or more statements, and they control the execution of those other statements in some way. Usually, compound statements span multiple lines (though some can be contained in one line). The `if`, `while` and `for` statements are the traditional control compound statements that you'll encounter.

#### `if` statements
The `if` statement is used for conditional statements, such as:

In [None]:
x = 10 # assign a smaller number to x
y = 20 # assign a bigger number to y

if x < y: # if this condition is true...
    print(x) # do this action

It usually operates by evaluating the expressions one-by-one until one is found to be true. We can add more conditional statements with `elif` and `else`. 

`elif` is short for "else if", and is used to pass another conditional.

`else` is used at the end of a line of conditionals, to express what must happen if every other condition is false.

In [None]:
x = 50

if x < 10:
    print('The number is less than 10')
    
elif 10 < x < 40:
    print('The number is between 10 and 40')

else:
    print('The number is greater than or equal to 40')

#### `while` statements
The `while` statement is used for repeated execution as long as an expression is true.

In [None]:
x = 0

while x < 15:
    x = x + 1
    print(x)

#### `for` statements
The `for` statement is used to iterate over the elements of a sequence (such as a string, tuple or list) or other iterable object.

In [None]:
numbers

In [None]:
for n in numbers:
    print(n)

***
<div class="alert alert-block alert-success">
<b>Challenge 3</b> 
</div>

Another math operator that we didn't mention before is the modulo operator `%`. Modulo is used to get the remainder of a division. For example:

<code>>>> 4%2 
0</code>

<code>>>> 5%2 
1</code>


1. Fill out the code below to create a program that can decide if the numbers in the `collection` list are even or odd. Hint: you'll need to use the modulo operator, and use two of the conditional statements above!

In [None]:
collection = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
for _______ in __________:
    ___________________:
        print(str(___) + ' is an even number')
    _____:
        print(str(___) + ' is an odd number')

***

### Functions
For the most part, we've been writing the code manually so far. Functions are pieces of code that are saved and given a name, so that we can use them again later. Many functions come built-in with Python, but we can get other functions by downloading third-party modules, or even by defining them ourselves (more on that later).

Functions contain three main parts:

> **Function name**: The name of the function, so that we know how to call it. For example, we've been using the `print()` and `help()` functions. The names of these functions are 'name' and 'help', respectively.

> **Arguments**: Arguments are the variables that we can give the function in order to tell the function how to work. Arguments go into the parantheses `( )` of the function. Some functions require no arguments, while some functions require some. 
<br>
To find out what the arguments you need to pass the function are, you can look at the help documentation for the function.

> **Return value**: Functions return some sort of output, though it may not always be printed to the screen; it may be stored in memory.

Let's take a closer look at the `print()` function that we've been using. Obviously, the function name of `print()` is 'print'. We call the function by writing this word in the command line.

What arguments can `print()` take?

In [None]:
?print 

The documentation for the `print()` function provides us with this information about the arguments. The arguments are separated by commas:

`print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)`

> `value` is the value that we're trying to print

> `...` means that we can repeat the preceding argument as needed. In this case, we can print more than one value as long as we separate them by commas.

> `sep=' '` is the string that is inserted between values; in this case, the default is a space character.

And so on...


In [None]:
print('This', 'sentence', 'will', 'be', 'separated', 'by', 'underscores', sep="_")

#### Methods
Methods are a type of functions. They differ from regular functions because methods are associated with a particular object. Methods are 'tacked-on' to the end of the object to which it is operating on. Methods can also take arguments in the parantheses.

We've already used methods earlier when we used the `.append()` method on a list.

In [None]:
fruit = ['apple', 'banana', 'cherry']

In [None]:
fruit.append('orange')

In [None]:
fruit

There are [many other methods](https://www.w3schools.com/python/python_ref_list.asp) that we can use with lists, that will be helpful for data analysis later on.

In [None]:
copy_of_fruit = fruit.copy()

In [None]:
copy_of_fruit

In [None]:
fruit.extend(copy_of_fruit)

In [None]:
fruit

In [None]:
fruit.count('banana')

In [None]:
fruit.index('cherry')

In [None]:
fruit.pop(2)

In [None]:
fruit

In [None]:
fruit.remove('apple')

In [None]:
fruit

In [None]:
fruit.clear()

In [None]:
fruit

### User-defined functions
In addition to functions and methods that are built-in or that come from third-party modules, we will at times want to define our own functions. We do this when we write some code that we want to re-use over and over.

A user-defined function contains the same three main parts as other functions: a **function name**, **arguments** and a **return value**.

To create a user-defined function, we begin with the `def` statement. After the `def` statement, we write our function's name, and then in parantheses, we specify the arguments, followed by a `:`.

The body of the function contains the code that we are saving in the function. We must end the code by returning some value with `return`. 

Let's put this into practice by writing function that takes two numbers, and adds them together.

In [None]:
def my_calculator(x, y): 
    answer = x + y   # add two numbers (x, y) and assign the sum to the 'answer' variable
    return answer    # return 'answer'

In [None]:
print(my_calculator(20, 43))

***
<div class="alert alert-block alert-success">
<b>Challenge 4</b> 
</div>

1. Write a function that finds the max of two numbers and returns the max number. Call that function in a new row to show that it works.

***