<a href="https://colab.research.google.com/github/soriarty-pt536/soriarty-pt536/blob/main/Day1_Practice_Introduction_to_Python_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Paradigms of programming:

Programming, in abstract is the activity that executes instructions on data stored in the main memory of a computer. This activity can be carried out in multiple ways / structures. Different paradigms of programming represent different styles of thinking about how you structure this activity.

<img src="https://adriancolyer.files.wordpress.com/2019/01/Programming-paradigms-Fig-2.png" width=85%>

[Source](https://blog.acolyer.org/2019/01/25/programming-paradigms-for-dummies-what-every-programmer-should-know/)

**Common programming paradigms include:**

- **Imperative** in which the programmer instructs the machine how to change its state,
	- **procedural** which groups instructions into procedures,
	- **object-oriented** which groups instructions together with the part of the state they operate on,
- **Declarative** in which the programmer merely declares properties of the desired result, but not how to compute it
	- **functional** in which the desired result is declared as the value of a series of function applications,
	- **logic** in which the desired result is declared as the answer to a question about a system of facts and rules,
	- **mathematical** in which the desired result is declared as the solution of an optimization problem

[Wikipedia](https://en.wikipedia.org/wiki/Programming_paradigm)

# Procedural programming

The data we are concerned with is typically in the form of **Variables**, containers for values (at certain locations of the memory, typically accessible by a human readable, given _variable name_). We typically assign the names and values of variables ourselves.

Example:

`variable_name = 10`

Please observe, that in this case the `=` operator stood for _let it be equal_ or _assign value to_. This is a different operation than comparing two values and _asking if they are equal_ (that would have been `variable_name == 10` if we are using Python).

Procedural programming starts from the notion, that a set of actions on the data we would typically like to run more than once.

More precisely:

"Procedural programming is a programming paradigm, derived from structured programming, based on the concept of the procedure call. Procedures, also known as routines, subroutines, or **functions**, simply contain a **series of computational steps to be carried out**. Any given procedure might be called at any point during a program's execution, including by other procedures or itself."

[Wikipedia](https://en.wikipedia.org/wiki/Procedural_programming)

**So a function is just a set of instructions (themselves often functions) to take some input, manipulate it and give back an output.**

A function has no notion of "state", so it should ideally operate the same on the same input whenever it is callled, irrespective of any context change, meaning: whatever _else_ is in memory, the function has to have the same effect, and only on the variables (parameters) it was given. It is **not** a good practice, to make a function modify variables outside of it's _"scope"_. ("Don't touch variables you were not given!") 

(Sidenote: In functional programming function can be considered as equivalent to the mathematical notion of function, that is, it "maps" or "transforms" the input data into the output data.)


There are multiple layers of functions in programming languages, the language comes with it's own set built-in functions, and we can define our own ones for re-use, as well as using **other people's functions contained in packages**.

In case of Python the structure looks like this:

<img src="https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2018/01/Python-Functions.jpg" width=65%>

## Basic structure of a function is:

Suppose I want to bundle together the execution of to actions on variables. 

Let us see how the anatomy of such a function would look like:

```python
definition_keyword the_name_of_the_function(with_this, with_that):
    local_variable = 10
	do_something(with_this)
	do_something_else(with_that, local_variable)
	finish_and_give_back_results 
```

In this description the `definition_keyword` is a fixed keyword (`def` in Python), which marks the beginning of the function definition (together with `:` in Python, discussed below), and inside the brackets there are the variables / _parameters_ `with_this` and `with_that` on which the function should operate. (In Python, brackets themselves mark the fact that this is a function call, so they are mandatory, even if there are no parameters.)

From definition to finish this represents the _body of the function_.

`do_something` and `do_something_else` are some actions you can carry out, typically other system or user defined functions. 

Please observe, that I am at liberty to define new variables inside a function and use them. These variables remained _"inside the scope of"_ the function, meaning when the function finishes, _they get deleted_, so they are only accessible inside the function, temporarily.

# Object oriented programming: Classes vs Object instances 


Classes are abstract containers of code and data, that often represent real life abstract entities / functionalities. ("TheStudent" is someone, who can have this and that property and who can carry out this or that function - in abstract.)

<img src="http://drive.google.com/uc?export=view&id=1RGTfY3TRE--541fBsCXgGd0qaVCBsSuV" width=65%>

One can not execute (call) a function of a class ("TheStudent" in abstract), just a function of an instance ("john", a particular student "instance" - that's why poor john is lowercase, since it's customary for instances.).

For this, one has to "instantiate" a class, that is, create a particular instance.

```python

john = TheStudent()

# Meaning: Let us create an instance of student and put him into the variable 'john'

# Please observe the usage of CamelCase - customary for classes!
# Please observe the () syntax, which we will discuss later (In this case it calls the __init__ function of the Class) 
```

After this we can call a function of it.

```python

john.do_homework(when="now")

# More on function calling syntax later.
```

**Objects typically have some "object variables" (like age, in the student case), that are there as "memory", so they often hold some kind of "state".**

```python
john.age = 13
```

This also implies, that if I call the function of an object multiple times, it can behave differently, since it has an inner state that it can rely on. Objects have a kind of life-cycle of their own.

# Python introduction

Python is a high-level, general purpose programming language, which is

+ interpreted (programs do not have to be compiled before running)
+ supports **multiple programming paradigms** (objected oriented, functional, procedural, imperative - so you will often see code with mixed elements from these)
+ dynamically typed (variables do not have a declared type, they can be bound to objects of any type during execution)

with a strong focus on code readability. In fact, Python's syntax is frequently compared to that of pseudocode (not looking like code at all, more like normal written text defining actions / procedures).

Python has two major versions that are rather incompatible and are still widely used, the 2.x and the 3.x version series. In this course we will use Python version 3.7.

## REPL - Read Eval Print Loop

The interpreted nature of Python allows it to be run interactively. 

You can try out an interactive Python environment under https://repl.it/repls/PotableAdorableBrains?lite=true.

In case you are using this tutorial online, you can in fact try it below in the code cell.

In [None]:
from IPython.display import HTML

HTML('<iframe height="400px" width="100%" src="https://repl.it/repls/PotableAdorableBrains?lite=true" scrolling="no" frameborder="no" allowtransparency="true" allowfullscreen="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals"></iframe>')

In this case a **Python interpreter** is running somewhere, I have a terminal access to it and it executes my commands interactively.

This is the basic concept behind the execution of notebooks (such as this).

## Basic syntax

In [None]:
# Comment lines start with a #

"""Strings standing alone in the code (so called docstrings)
can also be used as comments/documentation.""" # They don't effect the execution of the code

print(1) # statements do not end with a semicolon and are typically single line
print \
("Python") # if you must, you can use a backslash to continue the statement in the following line

a = 3 # assignment

mylist=["this","course","is","great!"] # list of elements
# Elements listed within brackets can span several lines
l= [1, 
    2,
    3]
print(l)
b = (123 +
     32 +
     44)
print(b)
print("Welcome",
     "to",
     "Python!")

# Multiple statements on a single line have to be separated by semicolons
x = 24; print(x) ; print(mylist)

1
Python
[1, 2, 3]
199
Welcome to Python!
24
['this', 'course', 'is', 'great!']


## Code blocks are indicated with indentation
The only really unusual (shocking?) aspect of Python's syntax is that it uses **indentation** to group together statements into blocks. Where other languages would use brackets or "begin"--"end" pairs to indicate the beginning and end of a block, and maybe indent code for aesthetic reasons, Python uses exclusively indentation level and colons after block-introducing constructs to indicate block structure.

The main rule: If two statements
+ are at the same indentation level, and
+ there are no statements between them with lower indentation level
then they belong to the same block.

Control structures and data types will be discussed later, so the following examples are only to provide an intuitive demonstration of Python blocks:

In [None]:
a = 1 # block 0 -- unindented top level
b = -1 # block 0 -- unindented top level

if a > 0:                            # block introducing constructs end with a colon
    print("a is positive")           # block 1 starts
    print("Let's examine b as well") # block 1
    if b < 0:                        # block 1 --  block introducing construct
        print("b is negative")       # block 2 
    else:                            # block 1 -- block introducing construct
        print("b is non-negative")   # block 3
    print("End of program")          # block 1

a is positive
Let's examine b as well
b is negative
End of program


Indentation can make a huge difference to a program. Compare

In [None]:
if  a < 0:
    print("a is negative")
    print("End of program")

with

In [None]:
if  a < 0:
    print("a is negative")
print("End of program")

End of program


As for the size of indentation or the use of tabs vs spaces Python's only requirement is consistency, but it's customary to use 4 spaces per indentation level and we will stick to that.

## Basic data types

Python has arbitrary large integer and float number types with the usual arithmetic operators (a 'complex' type is supported as well but we will not use it):

In [None]:
# Integers

a = 5; b = 2 # Variable assignment for any type of variable

print(type(a))
print(a + b) # Addition
print(2 - 5) # Subtraction
print(a * b) # Multiplication
print(a ** b) # Exponentiation
print(a // b) # Floor division 
print(a % b) # Modulus
print(a / 2) # Division -- the result can be a float!!
print(type(a / 2))

<class 'int'>
7
-3
10
25
2
1
2.5
<class 'float'>


In [None]:
# Floats
c = 2.5
print(type(c))
d = 3. # This is a float because of the decimal point!
# All operators listed above for integers are supported:
print(c + d)
print(d % c)
print(10 ** 100) # Arbitrary precision integer
print(10. ** 100) # Float

<class 'float'>
5.5
0.5
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1e+100


In [None]:
# Booleans are the the values True and False

a = True; b = False
print(type(a))
print(a and b) # Logical and
print(a or b) # Logical or
print(not a) # Negation
print(a != b) # Xor with booleans (see next cell for other usage)

<class 'bool'>
False
True
False
True


In [None]:
# Comparisons return Boolean values

a = 3; b = 4
print(a == b) # General equality comparison, returns True if a and b have equal content
a = "a"; b = "b"
print(a != b) # True if a and b do not have equal content

# Arithmetic comparisons

a = 3; b = 4
print(a < b)
print(a >= b)

# String comparisons
a = "a"; b = "b"
print(a < b) # Lexicographic comparison

False
True
True
False
True


In [None]:
# Strings
a = "Artificial" # String literals can use double and
b = 'Intelligence' # single quotes as well
print(type(a))
print(len(a)) # The len() function returns their length
c = """
Strings can be multiline as well
when delimited by three quotes
"""
print(c)
s = " "
print(a + s + b) # Concatenation
print('%s %s %d' % (a, b, 2018.1)) # Traditional string formatting can be used
print(f"Human {b}") # Starting with Python 3.6 string interpolation is also supported in 'f' prefixed strings

<class 'str'>
10

Strings can be multiline as well
when delimited by three quotes

Artificial Intelligence
Artificial Intelligence 2018
Human Intelligence


In [None]:
# Shortcuts to apply an operator to a variable's value and assign the result
# to the same variable are supported for most operations
a = 1
a += 5
print(a)
a **= 2
print(a)
b = "A"
b += "I"
print(b)


6
36
AI


In [None]:
# Although not exactly a data type, here is the best place to mention None
# which is Python's nil/null/nothing value.
# None appears as an (implicit) return value in many cases

print(type(None))
print(None)
print(print(None)) # Functions with no explicit return value return None
l = [None, print(None)] # None is a first class citizen
print(l)

<class 'NoneType'>
None
None
None
None
[None, None]


In [None]:
# The data types we have discussed can be "converted" to each other 
# (where this makes sense) simply by calling their constructors with the value
# to be "converted":

print(bool(1))
print(str(2008))
print(int(3.5))

True
2008
3


**Fair warning**

+  '++' like operators to increment or decrement by 1 are not supported. 
+ **All listed basic data types are immutable.**

## Containers

In [None]:
# Lists
l = [1, "a", True] # are mutable arrays that can hold any type of objects
print(type(l))
print(len(l)) # Their length can be queried
print(l[1]) # Elements can be accessed by their index. Indexing starts from 0.
print(l[-1]) # A negative index gives the position from the end. -1 is the index of the last element.
l = [0,1,2,3,4,5]
print(l[2:4]) # Slicing is also supported. The first index is inclusive, the second is exclusive.
print(l[3:-2])
print(l[3:]) # If no start/end index is given then the slice starts from the beginning/ends with the last element.
l2 = ["a", "b"]
print(l + l2) # Lists can be concatenated
l.append(True) # New elements can be appended to lists
print(l)
del(l[0]) # Elements can be deleted by referring on it with index
print(l)

<class 'list'>
3
a
True
[2, 3]
[3]
[3, 4, 5]
[0, 1, 2, 3, 4, 5, 'a', 'b']
[0, 1, 2, 3, 4, 5, True]
[1, 2, 3, 4, 5, True]


In [None]:
# Dictionaries
lang_year = {"Python": 1991, "Lisp": 1958, "Java": 1995} # are key-value stores
print(lang_year["Python"]) # Return the value associated with the key
print("C++" in lang_year) # Whether a key has an associated value
# lang_year["C++"] would raise an exception since "C++" is not among the keys
print(lang_year.get("C++", "key not in dict")) # Returns the value for the key if there is one,
print(lang_year.get("Lisp", "key not in dict")) # a default value otherwise
lang_year["C++"] = 1983 # Change the value for an already existing key or add a key with a value
print(lang_year.get("C++", "key not in dict")) # now the "C++" can be found among the keys 
del lang_year['Java'] # Remove an element from a dictionary
print(lang_year.get("Java", "key not in dict"))

1991
False
key not in dict
1958
1983
key not in dict


In [None]:
# Sets
cities = {"Rome", "Berlin", "Paris"} # are mutable unordered collections of unequal and immutable (!) objects
print(type(cities))
print("Rome" in cities) # Whether an object is in the set
cities.add("London") # Objects can be added
cities.add("Berlin") # Adding an already existing object: the set will contain Berlin once. 
cities.remove("Paris") # and removed
print(cities)

<class 'set'>
True
{'Berlin', 'Rome', 'London'}


In [None]:
# Tuples
t = ("a", "b", "c", "d")  # Are immutable ordered lists. Since they are immutable, they can be
                          # keys in dictionaries and elements in sets, unlike lists
print(type(t))
print(len(t))
print(t[0]) # Elements can be accessed in the same way as that of lists
print(t[1:-1])

<class 'tuple'>
4
a
('b', 'c')


### Comprehension

Python has a special syntax for defining collections based on other collections, the so-called comprehension, which can be used for defining lists, sets and dictionaries that are the results of mapping and/or filtering the elements of another collection/iterable.

In [None]:
# List comprehension

nums = [1, 2, 3, 4, 5, 6, 7, 8, 9]
squares_of_odds = [x * x for x in nums if x % 2 != 0]
print(squares_of_odds)

[1, 9, 25, 49, 81]


In [None]:
# Dictionary comprehension

nums_with_squares = {x: x*x for x in nums}
print(nums_with_squares)

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}


In [None]:
# Set comprehension 

print({x * x for x in nums}) # Order will be arbitrary!


{64, 1, 4, 36, 9, 16, 49, 81, 25}


### Generator expressions

Iterators are Python objects that sequentially return items, one at a time. Calling the 'next' function with an iterator argument returns the next item or raises the StopIteration exception if the iterator has been exhausted.

An especially easy way of creating iterators is provided by Python's generator expressions, which have the same syntax as list or set comprehension:

In [None]:
g = (x + 10 for x in [1,2])
type(g)

generator

In [None]:
next(g)

11

In [None]:
next(g)

12

In [None]:
# next(g) # Would raise StopIteration

Iterators are NOT containers: in general the generated elements don't even have to exist in the memory at the same time. They can be converted into ones, of course:

In [None]:
g = (x + 10 for x in [1,2])
print(list(g)) # This generates a list by calling 'next' until g gets exhausted
list(g) # This will be empty because g has been exhausted

[11, 12]


[]

## Control structures

In [None]:
# Conditional -- the syntax is as can be expected, note "elif" though which is somewhat idiosyncratic.

lisp_year = lang_year["Lisp"]

if lisp_year < 1960:
    print("Lisp is really ancient.")
elif lisp_year <= 2000:   # some people say 2000 is 21th century, but 2000 is officially part of 20th century
    print("Lisp is from the 20th century.")
else:
    print("Lisp is recent.")

Lisp is really ancient.


In [None]:
# While

i = 0
while i < 5:
    print(i)
    i += 1

0
1
2
3
4


In [None]:
# Iteration/For loop
# One can iterate over the elements of containers with a for loop

# Lists

fruits = ["apple", "orange", "pear"]

for fruit in fruits:
    print(f"{fruit} is a fruit.")

apple is a fruit.
orange is a fruit.
pear is a fruit.


In [None]:
# For loops can iterate over any iterator, eg. over one created using 
# a generator expression:

fruits = (fruit + "s" for fruit in fruits) 
print(type(fruits))
for s in fruits:
    print(s)

<class 'generator'>
apples
oranges
pears


In [None]:
# if the indices are also needed:

for idx, fruit in enumerate(fruits):
    print(f"Fruit {fruit} has index {idx}.")
    
print()

# if _only_ indices are needed, this is the idiomatic way:

for i, _ in enumerate(fruits):
    print(i)




In [None]:
# For works similarly for sets and tuples, but note that the iteration order for sets is arbitrary
# and can change
for city in cities:
    print(city)

Berlin
Rome
London


In [None]:
# A for loop over a dictionary iterates over the keys:

for lang in lang_year:
    print(lang)

Python
Lisp
C++


In [None]:
# If both the keys and values are needed then you should use the items method:

for lang, year in lang_year.items():
    print(f"The {lang} language is from year {year}.")

The Python language is from year 1991.
The Lisp language is from year 1958.
The C++ language is from year 1983.


## Functions

In [None]:
# Function calls have the syntax <fun_identifier>(<arg_1>,...,<arg_n>)

print("a", "b", "c")

# Functions can be called only with brackets around the argument list, even if there are no arguments.
# Without the argument list a function name always refers to the function itself:

from time import localtime # Localtime has no arguments.

print(localtime()) # Print the result of calling the localtime function 
print(localtime) # Print localtime, the function

# Functions are "first class citizens" in Python: they can be arguments and return values of functions/methods 
# and elements of containers.

my_list = [localtime, localtime]
print(my_list)
print(my_list[0]())

a b c
time.struct_time(tm_year=2019, tm_mon=9, tm_mday=27, tm_hour=14, tm_min=15, tm_sec=12, tm_wday=4, tm_yday=270, tm_isdst=1)
<built-in function localtime>
[<built-in function localtime>, <built-in function localtime>]
time.struct_time(tm_year=2019, tm_mon=9, tm_mday=27, tm_hour=14, tm_min=15, tm_sec=12, tm_wday=4, tm_yday=270, tm_isdst=1)


In [None]:
# Functions are defined with the def keyword

def sign(x):
    if x < 0:
        return -1
    elif x == 0:
        return 0
    else:
        return 1

print((sign(-5)))

-1


In [None]:
# In the absence of an explicit 'return' functions return None

def hello(x): 
    print(f"Hello {x}!") 

hello("Guido") 
print(hello("Python")) 

Hello Guido!
Hello Python!
None


In [None]:
# Arguments can have a default value, in that case they can be used as named/keyword arguments as well.

def hello_defaults(name, greeting="Hello", ending="!"):
    print(f"{greeting} {name}{ending}")
    
hello_defaults("Python")
hello_defaults("Lisp", "Good morning")
hello_defaults("Java", ending="?", greeting="Good day") # Named args can be in any order

Hello Python!
Good morning Lisp!
Good day Java?


In [None]:
# To define functions with a variable number of arguments
# the last arguments in the definition can start with one or two asterisks.
# These will contain the values of all extra positional arguments that are passed 
# but not listed explicitly in the function definition. 

def hello_varargs(name, *names):# the value of the _single_ asterisked argument will be a tuple 
    print(f"Hello {name}!")     # containing the value of passed _additional_ parguments in the function call
    for name in names:
        print(f"Hello to {name} as well!")
        
hello_varargs("Python", "Java", "Lisp")

Hello Python!
Hello to Java as well!
Hello to Lisp as well!


In [None]:
# An argument starting with a double asterisk collects all extra named/keyword arguments with their values
# in a dictionary

def hello_kwargs(name, greeting="Hello", **attribs):
    print(f"{greeting} {name},")
    for attr_name, attr_value in attribs.items():
        print(f"your {attr_name} is {attr_value}")
        
hello_kwargs("Python", greeting="Good day", developer="Guido van Rossum", year="1995", popularity="skyrocketing")

Good day Python,
your developer is Guido van Rossum
your year is 1995
your popularity is skyrocketing


## Lambda

In Python programming language you could also use Lambda expressions, which is a good start for learning functional programming style. With lambda expressions you can create a function without  function name and header. You can read more about it <a href="https://realpython.com/python-lambda/">here</a>




In [None]:
a=[1,2,3,5]

#To create a list with numbers multiply by 2, it takes 3 lines:
def my_function(x):
    return x * 2
list(map(my_function,a)) 


# with Lambda just one line:
list(map(lambda x: x * 2,a))


    

[2, 4, 6, 10]

## Generator functions

Generator functions are special functions that use the yield keyword instead of return, and when called return an iterator. The returned iterator, in turn, calls the generator function at the  first 'next' call, runs it until it reaches 'yield', returns the yielded object and suspends the run of the generator function until the next 'next' call, when it resumes running until reaching a 'yield' etc. If 'next' is called for the iterator and the generator function's execution ends without reaching a 'yield' then a 'StopIteration' exception is raised.

In [None]:
def two_powers():
    """Generator function yielding the powers of 2 under 1000.
    """
    p = 1
    while p < 1000:
        yield p
        p = p * 2

g = two_powers()

for p in g:
    print(p)

1
2
4
8
16
32
64
128
256
512


## Objects and classes

All data in Python are represented by objects: all instances of the primitive data types, all containers etc.
are objects with associated classes.

In [None]:
# Objects have associated methods and attributes
# Methods can be called with the dot syntax: <object_identifier>.<method_name>(arg_1,...,arg_2)
# All function calling conventions (named arguments etc.) apply to methods as well

a = "python"
a_capitalized = "python".upper()
print(a_capitalized)

# Similarly to functions, without brackets the method name refers to the method itself:
print(a.upper)

# Attributes, in contrast, are accessed without brackets:

lt = localtime()
print(lt)
print(lt.tm_year)

PYTHON
<built-in method upper of str object at 0x10c08c5e0>
time.struct_time(tm_year=2019, tm_mon=9, tm_mday=27, tm_hour=14, tm_min=15, tm_sec=12, tm_wday=4, tm_yday=270, tm_isdst=1)
2019


In [None]:
# Objects are always instances of a class, and new classes can easily be defined
# from scratch or by inheriting from an already existing class

class Point: # classes to inherit from could be listed between brackets
             # after the class name
    """A class representing a point on a plane.
    """
    
    # method definition syntax is the same as that of functions,
    # but when called the first argument -- always "self" by convention -- is bound to
    # the object itself.
    
    def __init__(self, x, y): # The method called __init__ is always the default constructor
        """Initialize a point with x, y coordinates.
        """
        self.x = x  # we are setting the x an y attributes
        self.y = y  # using the constructor arguments
            
    def distance(self, other):
        """Return the distance between two points.
        """
        return ((self.x - other.x) ** 2 + (self.y - other.y) ** 2) ** 0.5
        
p1 = Point(0, 0) # The class's name with arguments calls the constructor
print(p1)
p2 = Point(3, 4)
print(p1.distance(p2))
p1.x = 3; p1.y = 3;
print(p1.distance(p2))

<__main__.Point object at 0x10e1c9400>
5.0
1.0


## Importing

In [None]:
# Classes, functions etc. defined in other modules have to be imported before they can be used

from math import sin, sqrt # individual functions, classes etc can be imported from modules
print(sqrt(25)) # the identifiers imported this way become avaiable in the current namespace
from math import * # This imports the whole content of the module into the current namespace -- not recommended

# Since modules are frequently organized into packages (and packages again into larger ones),
# the from syntax can be used to import a package or module as a whole, e.g.
from sklearn import cluster
print(cluster.KMeans) # This makes the content of the whole package available

import random # This makes available the whole content of the module
print(random.random()) # but the module's name (path) has to be used before the imported identifiers

import random as r # Makes available the whole content of the module, they can be accessed with the specified prefix
print(r.random())

5.0
<class 'sklearn.cluster.k_means_.KMeans'>
0.01876545042221356
0.6059466160613729


# Pandas (Python Data Analysis Library)

## Data frame basics

[Pandas](https://pandas.pydata.org/) is an open source Python data analysis library. The central Pandas abstraction is that of a [data frame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html): these are spreadsheet-like objects in which columns and rows are labeled with Python objects, e.g. strings or numbers. The list of row labels is called the index of the data frame.

In [None]:
import pandas as pd # Pandas is by convention imported as pd

# Data frames can be created in several ways, and we will see methods to create 
# them from external data sets later, but for now we create them simply from lists:

df = pd.DataFrame([["Python","multi", "Guido van Rossum", 1995, 3],
                    ["Lisp","multi", "John McCarthy", 1958, 33],
                    ["C++", "multi", "Bjarne Stroustrup", 1985, 4],
                    ["Java", "multi", "James Gosling", 1996, 1],
                    ["Haskell","functional", "Lennart Augustsson", 1990, 40],
                    ["Prolog", "logic", "Alain Colmerauer", 1972,  36]], 
                  columns=["name", "paradigm", "creator", "year", "popularity_rank"])

df # data frames are nicely visualised in Jupyter without printing

Unnamed: 0,name,paradigm,creator,year,popularity_rank
0,Python,multi,Guido van Rossum,1995,3
1,Lisp,multi,John McCarthy,1958,33
2,C++,multi,Bjarne Stroustrup,1985,4
3,Java,multi,James Gosling,1996,1
4,Haskell,functional,Lennart Augustsson,1990,40
5,Prolog,logic,Alain Colmerauer,1972,36


The first and last $n$ rows (the default is 5) of a data frame can be quickly inspected by the [head](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) and [tail](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) methods:

In [None]:
df.head(2)

Unnamed: 0,name,paradigm,creator,year,popularity_rank
0,Python,multi,Guido van Rossum,1995,3
1,Lisp,multi,John McCarthy,1958,33


 Note that we haven't specified an index, so it is simply a numbering (starting, as usual in Python, from 0). If we use the language names as indices we get

In [None]:
df = pd.DataFrame([["multi", "Guido van Rossum", 1995, 3],
                    ["multi", "John McCarthy", 1958, 33],
                    ["multi", "Bjarne Stroustrup", 1985, 4],
                    ["multi", "James Gosling", 1996, 1],
                    ["functional", "Lennart Augustsson", 1990, 40],
                    ["logic", "Alain Colmerauer", 1972,  36]], 
                  columns=["paradigm", "creator", "year", "popularity_rank"], 
                  index=["Python", "Lisp", "C++", "Java", "Haskell", "Prolog"])

df

Unnamed: 0,paradigm,creator,year,popularity_rank
Python,multi,Guido van Rossum,1995,3
Lisp,multi,John McCarthy,1958,33
C++,multi,Bjarne Stroustrup,1985,4
Java,multi,James Gosling,1996,1
Haskell,functional,Lennart Augustsson,1990,40
Prolog,logic,Alain Colmerauer,1972,36


Columns can be accessed by label:

In [None]:
df["creator"]

Python       Guido van Rossum
Lisp            John McCarthy
C++         Bjarne Stroustrup
Java            James Gosling
Haskell    Lennart Augustsson
Prolog       Alain Colmerauer
Name: creator, dtype: object

In [None]:
# With appropriate column labels (string without spaces) a dotted syntax also works:
df.year

Python     1995
Lisp       1958
C++        1985
Java       1996
Haskell    1990
Prolog     1972
Name: year, dtype: int64

In [None]:
# Rows of can be accessed by the row index using loc:
df.loc["Python"]

paradigm                      multi
creator            Guido van Rossum
year                           1995
popularity_rank                   3
Name: Python, dtype: object

In [None]:
# loc can also be used to access 2-dimensional ranges using Python's slicing syntax extended to
# two dimensions:
df.loc["Lisp": "Java", "creator":"popularity_rank"] # SLICE INCLUDES THE CLOSING INDEX!!!!

Unnamed: 0,creator,year,popularity_rank
Lisp,John McCarthy,1958,33
C++,Bjarne Stroustrup,1985,4
Java,James Gosling,1996,1


In [None]:
# iloc provides numerical index based access to dataframe ranges 
df.iloc[:3, 1:3]

Unnamed: 0,creator,year
Python,Guido van Rossum,1995
Lisp,John McCarthy,1958
C++,Bjarne Stroustrup,1985


In [None]:
# Individual cells can also be accessed by loc and iloc:
print(df.loc["Python", "year"])
print(df.iloc[0,0])

1995
multi


In [None]:
# Columns and indexes can be renamed

df.rename(columns={"popularity_rank":"tiobe_idx"}, index={"C++": "Cpp"}, inplace=True)
df

Unnamed: 0,paradigm,creator,year,tiobe_idx
Python,multi,Guido van Rossum,1995,3
Lisp,multi,John McCarthy,1958,33
Cpp,multi,Bjarne Stroustrup,1985,4
Java,multi,James Gosling,1996,1
Haskell,functional,Lennart Augustsson,1990,40
Prolog,logic,Alain Colmerauer,1972,36


## Descriptives

Pandas can produce a descriptive statistics summary about a data frame (the summary is itself a data frame) with the [describe method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html):

In [None]:
df.describe()

Unnamed: 0,year,tiobe_idx
count,6.0,6.0
mean,1982.666667,19.5
std,14.90861,18.598387
min,1958.0,1.0
25%,1975.25,3.25
50%,1987.5,18.5
75%,1993.75,35.25
max,1996.0,40.0


In [None]:
# If there are numerical columns then only those columns' statistics are included by default --
# the include parameter can be used to produce all:
df.describe(include="all")

Unnamed: 0,paradigm,creator,year,tiobe_idx
count,6,6,6.0,6.0
unique,3,6,,
top,multi,John McCarthy,,
freq,4,1,,
mean,,,1982.666667,19.5
std,,,14.90861,18.598387
min,,,1958.0,1.0
25%,,,1975.25,3.25
50%,,,1987.5,18.5
75%,,,1993.75,35.25


## Cleaning the data

### Missing data
Many data sets have missing data in certain examples, in Pandas this is typically represented by NaN ("not a number") values. One way of dealing with this problem is to simply delete all problematic cases. In Pandas this can be done with the [dropna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method of data frames, which, among others, can delete all rows with missing values.

In [None]:
df.describe(include="all")

Unnamed: 0,paradigm,creator,year,tiobe_idx
count,6,6,6.0,6.0
unique,3,6,,
top,multi,John McCarthy,,
freq,4,1,,
mean,,,1982.666667,19.5
std,,,14.90861,18.598387
min,,,1958.0,1.0
25%,,,1975.25,3.25
50%,,,1987.5,18.5
75%,,,1993.75,35.25


In [None]:
df.describe(include="all").dropna()



Unnamed: 0,paradigm,creator,year,tiobe_idx
count,6,6,6.0,6.0


Since removing all cases with missing data can distort the data set and introduce bias, methods have been developed to __impute__ missing values, and keep the problematic examples. Trivial approaches include imputing the mean or the median for numerical columns but far more sophisticated approaches exist.

**For more detailed methods see [here](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4).**

###  Label encoding 
In certain situations it is important to encode categorical data described by textual labels into numerical values -- this can be useful both for efficiency of representation and for certain learning algorithms.

Most frequently the labels will initially be represented by strings in Pandas, so our task is to encode these textual labels into 
numbers.

In [None]:
# We try to encode the 'paradigm' labels
df.paradigm.value_counts() # How many different values are in the column

multi         4
functional    1
logic         1
Name: paradigm, dtype: int64

Two of the most simple strategies to solve the problem is to
+ map the $k$ category values to numbers $0,\dots,k-1$, or
+ use a so-called <b>one-hot (aka dummy) encoding </b> and introduce a separate binary feature/column for each category.

Pandas ha a dedicated categorical data type for this type of data, so the first strategy can be implemented by
converting the column into a category column:

In [None]:
df.insert(1, "paradigm_cat", df.paradigm.astype("category"))
df.paradigm_cat

Python          multi
Lisp            multi
Cpp             multi
Java            multi
Haskell    functional
Prolog          logic
Name: paradigm_cat, dtype: category
Categories (3, object): [functional, logic, multi]

Internally, Pandas automatically maintains a list of numeric codes for the categories, and a version of the column which is encoded with these codes is also available:

In [None]:
df.paradigm_cat.cat.codes

Python     2
Lisp       2
Cpp        2
Java       2
Haskell    0
Prolog     1
dtype: int8

so we can simply add this as a new column:

In [None]:
df.insert(2, "paradigm_code", df.paradigm_cat.cat.codes)
df

Unnamed: 0,paradigm,paradigm_cat,paradigm_code,creator,year,tiobe_idx
Python,multi,multi,2,Guido van Rossum,1995,3
Lisp,multi,multi,2,John McCarthy,1958,33
Cpp,multi,multi,2,Bjarne Stroustrup,1985,4
Java,multi,multi,2,James Gosling,1996,1
Haskell,functional,functional,0,Lennart Augustsson,1990,40
Prolog,logic,logic,1,Alain Colmerauer,1972,36


One-hot encoding can be achieved by using the [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) method. This one-hot encodes all columns of `category` or `object` dtype by default, but accepts a `columns` argument, which can explicitly list the columns to be encoded.

In [None]:
df_one_hot = pd.get_dummies(df, columns=["paradigm_cat"])
df_one_hot

Unnamed: 0,paradigm,paradigm_code,creator,year,tiobe_idx,paradigm_cat_functional,paradigm_cat_logic,paradigm_cat_multi
Python,multi,2,Guido van Rossum,1995,3,0,0,1
Lisp,multi,2,John McCarthy,1958,33,0,0,1
Cpp,multi,2,Bjarne Stroustrup,1985,4,0,0,1
Java,multi,2,James Gosling,1996,1,0,0,1
Haskell,functional,0,Lennart Augustsson,1990,40,1,0,0
Prolog,logic,1,Alain Colmerauer,1972,36,0,1,0


### Normalization

For certain models the scale of the input variebles is of paramount importance, so it is generally condsidered a good practice to transform the inputs towards zero mean and unit variance. Simplest approach is to use ScikitLearn's [standard scaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), but more methods exist. See [here](https://en.wikipedia.org/wiki/Feature_scaling) for details.