# Welcome to the Python & Jupyter notebook Basics

In this part, we will make ourselves familiar with the **jupyter notebook** and **Python** programming language

In [None]:
"Hi! Welcome to scraping with Python."

<br>➡  This is a jupyter notebook cell
<br>➡  each cell needs to be **run** after you write your code: by clicking the `▶` button above, or with `shift/ctrl+ enter`
<br>➡  If you see a number between the brackets next to the cell, eg `[23]`, the cell **has been run**
<br>➡  If you see empty brackets `[ ]` the cell has **not been run**
<br>➡  If you see this `[*]`, the cell is **running**. You can then not run any other cells while one is running
<br>➡  **Important**: if you adjust a cell, you need to **run it again**!

#### Add new cels:
<br>➡ with the `+` sign in the top menu
<br>➡ by pressing ESC and then `a` (above) or `b` (below)


In [None]:
# this is a comment

<br>➡ put a **comment** in your code using the hashtag `#`
<br>➡  Everything after the hashtag won't be read by Python: `# This is a comment`

# Example scrapers

## Get info from multiple pages

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas

data = [] # creating an empty list - this will hold the data
print(f'--> This is now the content of data: {data}') # you can call variables in the string using {} and prefixing the string with f''

for n in range(1,6): # the range of pages we want the scraper to get
    print('\n') # this prints out an empty line
    print(f'--> We are on page number: {n}')
    URL = "https://www.theguardian.com/news/series/todayinfocus?page=" + str(n)
          # ammend the url for each page. You have to change the number into a string to be able to combine them!
    print(f'--> The url we are scraping now is: {URL}')

    website_request = requests.get(URL) # request the webpage
    print(f'--> This is what the request looks like: {website_request}. If you got a "Response 200" you are good!')
    website_content = website_request.text # get the contents out of the webpage.
    print(f'--> The first 763 characters of the website content: {website_content[:763]}')
    website_read = bs(website_content) # read the contents of the webpage.
    
    # Let's get all the headlines from the website!
    
    headine_class = "span.js-headline-text" # select the classname of the elements holding the headlines
    headlines = website_read.select(headine_class) # select all the headlines
    
    for h in headlines: # for each headline element
        print('\n') # this prints out an empty line
        print(f'--> We are now at the headline: {h}')
        h = h.text # get the text out of the element
        print(f'--> Without the HTML around it looks like this: {h}')
        data.append(h)
        print(f'--> This is now the content of data: {data}')

pandas.DataFrame(data).to_csv("Today_in_focus-headlines-5-pages.csv")

### ... this is how the code looks without all the comments and print statements

## Get more detailed info from one page

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas

data = [] # creating an empty list - this will hold the data
print(f'--> This is now the content of data: {data}') # you can call variables in the string using {} and prefixing the string with f''

URL = "https://www.theguardian.com/news/series/todayinfocus"
print(f'--> The url we are scraping now is: {URL}')

website_request = requests.get(URL) # request the webpage
print(f'--> This is what the request looks like: {website_request}. If you got a "Response 200" you are good!')
website_content = website_request.text # get the contents out of the webpage.
print(f'--> The first 763 characters of the website content: {website_content[:763]}')
website_read = bs(website_content) # read the contents of the webpage.

articles_details = website_read.select(".fc-item__content")
print(f'--> There are {len(articles_details)} articles on the page.')

n = 0

for a in articles_details:
    article = {} # creating an empty dictionary - this will hold the details
    #print(a)
    print("\n")
    n = n+1
    print(f"--> This is article number {n}")
    print(f'--> This is now the content of "article": {article}')
    
    article["title"] = a.select(".fc-item__title")[0].text # get the article's title
    article["link"] = a.select(".fc-item__link")[0].get("href") # get the article's link
    article["intro"] = a.select(".fc-item__standfirst")[0].text.strip() # get the article's intro
    
    print(f'--> This is now the content of "article": {article}')
    
    data.append(article) # put the article in the bigger list
    print(f'--> This is now the content of data: {data}')
    
pandas.DataFrame(data).to_csv("Today_in_focus-first-page.csv")

### ... this is how the code looks without all the comments and print statements

# THE BASICS

## **Variables**

![slicing](../img/variables.png)

<br>➡  `variables` = kind of box with a label on them in which you can stor  numbers, names, expressions, and even other variables
<br>➡  name of a variable is arbitrary, but it is useful if you know what it stands for.
<br>➡  create a variable with `variable_name =`
<br>➡  a variable name can only contain **lower and uppercase letters, numbers, and underscores. No spaces** or other funky characters!

### **Strings**

<br>➡ `string` = text
<br>➡ put it in between single or double quotes. `"text"` or `'text'`

What happens if you don't put the text between quotes?

<br>➡  you can use the `print()` method to view the content of a variable: `print(variable_name)`
<br>➡  or in jupyter notebook just the variable name

Print your string. Use `print()`

<br>➡ You can also do **addition** and **multiplication** with strings

<br>➡ there are obviously also things you can not do

### **Numbers**

<br>➡ `Integer` : whole number, such as `5`, `6` or `2454`

<br>➡ `Float` : a number that is not whole, such as `5.67` or `823.12`

<br>➡ You can do `addition`, `substraction`, `multiplication`, `division` (and more) with numbers.

### Collection: **Lists**

<br>➡ A `list` is a **collection** of other variables **separated by a comma**
<br>➡  Uses **brackets** `[]`
<br>➡ `empty_list = []`
<br>➡ `fibonacci = [1, 1, 3, 5, 8, 13, 21, 34]`
<br>➡ `weird_words = ["anguilliform", "borborygmus", "cybersquatting", "logomachy", "winebibber", "rumpot" ,"studmuffin"]`
<br> [more weird words](https://www.lexico.com/explore/weird-and-wonderful-words)
<br><br>Let's make a list!

Let's make a list of numbers

### Collection: **Dictionaries**

<br>➡ A `dictionary` is a **collection** of **key** and **values**
<br>➡ Uses **curly brackets** `{}`
<br>➡ `empty_dictionary = {}`
<br> Let's make a dictionary for our weird words list

In [50]:
weird_words = {"anguilliform" : "resembling an eel", "borborygmus" : "a rumbling or gurgling noise in the intestines", "cybersquatting" : "registering well-known names as Internet domain names, in the hope of reselling them at a profit", "logomachy" : "an argument about words",  "winebibber" : "a heavy drinker" , "rumpot" : "a heavy drinker", "studmuffin" : "a sexually attractive, muscular man"}
weird_words

{'anguilliform': 'resembling an eel',
 'borborygmus': 'a rumbling or gurgling noise in the intestines',
 'cybersquatting': 'registering well-known names as Internet domain names, in the hope of reselling them at a profit',
 'logomachy': 'an argument about words',
 'winebibber': 'a heavy drinker',
 'rumpot': 'a heavy drinker',
 'studmuffin': 'a sexually attractive, muscular man'}

Now let's make a new dictionary:
<br>
<br>➡ `animal_videos = {"Super Cute animals": "https://www.youtube.com/watch?v=C9OMAX91oyw","Cute Dogs And Cats" : "https://www.youtube.com/watch?v=sNU5TPHjHOc", "Laughing at Dogs": "https://www.youtube.com/watch?v=5U_Tf5TIHL0" }`
<br> You can create a new key / value pair like this:
<br>➡ `dictionary[key] = value`

In [47]:
# make a new item called "The Cutest Animals" for the animal_videos dictionary and assign this link to it: https://www.youtube.com/watch?v=4OSvHZIEa9g


In [None]:
# what happens if you re-assign the "The Cutest Animals" key with something else?


## **Debugging**

By now you are probably getting some errors. Let's see how to tackle them

![slicing](../img/bugs.png)

Most common errors:
<br>➡ `NameError` : the variable name is not right. Check for **typos**
<br>➡ `SyntaxError` : the syntax is not right, you can be **missing** brackets, quotes, or using the wrong ones
<br>➡ `AttributeError` : the method you are using is not correct
<br>...

![slicing](../img/debug.png)

## **Methods**

![methods](../img/methods.png)

**NOTE**
<br>➡ if a method **has a** `.` at the beginning, you put it **after** the variable, eg `variable.strip()`
<br>➡ if a method **has no** `.` at the beginning, you put the variable **in between** the brackets, eg `len(variable)`

### **String methods**

<br>➡`len()` will count the characters in a string: `len(string)`

<br>➡ `.strip()` : remove whitespaces before or after a string 

### **Number methods**

<br>➡ `str()` : makes a string out of a number

Useful while scraping

In [None]:
"www.website.org/something?page=" + str(10)

### **List methods**

<br>➡ Lists also have a lenght

<br>➡ `.append()` : add new elements to the list. Often used in `for` loops

<br>➡ `.join()` : joins a list to create a `string`

<br>➡ `.split()` : splits a string to creates a `list`

# **Slicing**

![slicing](../img/bread.png)

<br>➡ use a `[]` with a number after a list or string, e.g. `list[4]`
<br>➡ *Important* : in Python, we **start counting at 0**

![hello-string](../img/slice.png)

What is happening here?

In [None]:
print(weird_words)
weird_words[2]

In [None]:
print(weird_words)
weird_words[3:5]

# **Loops**

![slicing](../img/loops.png)

<br>➡  a set of instructions that are **continually repeated**

```python
for item in [something, something_else]:
    print(item)
```

<br>➡ `item` is a placeholder name for the 'something' in between the brackets
<br>➡ `print()` is a method that just displays the value 
<br>➡ in scraping used to for example open multiple URLs and extract data from them


In [None]:
# 1. Let's make a list of weird sports and display the items one by one using a for loop.
# inspiration : wife carrying, bog snorkelling, toe wrestling, cheese rolling, football 😜
# 2. print "I think x is really weird" for each item on the weird sports list



➡ sometimes we can create collections with a built-in **function**
<br>➡ for example `range(start, end)`

In [46]:
# Pretend it's later today at night, you are laying in the bed after the party at the bar, but you still can't sleep!
# You start counting sheep

# Create a for loop that prints out "1 sheep", "2 sheep", ... until "100 sheep"

# **Questions ?**
I'd be happy if you [leave me some feedback](https://goo.gl/forms/OtuNECgexYSyJGjh1) for this session so I can make it better.