# Data Science for Manufacturing - Workshop 2-1

##  Objectives


- Lists, tuples, dictionaries, functions and loops

- Explain what a library is and what libraries are used for.

- Pandas and Numpy: when and how to use them

- Import a Python library and use the functions it contains.

- Read tabular data from a file into a program.

- Indexing and slicing

- Investigating data


## Python essentials 

### Lists

Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1:



To add elements to the end of a list, we can use the `append` method. Methods are a way to interact with an object (a list, for example). We can invoke a method using the dot . followed by the method name and a list of arguments in parentheses. Let’s look at an example using `append`:

A list stores many values in a single structure.

- Doing calculations with a hundred variables called `data_point_001`, `data_point_002`, etc., would be at least as slow as doing them by hand.
- Use a list to store many values together.
    - Contained within square brackets `[...]`.
    - Values separated by commas `,`.
- Use `len` to find out how many values are in a list.

### Tuples

A tuple is similar to a list in that it’s an ordered sequence of elements. However, tuples can not be changed once created (they are “immutable”). Tuples are created by placing comma-separated values inside parentheses `()`.


   - What happens when you execute `a_list[1] = 5`?
   - What happens when you execute `a_tuple[2] = 5`?
   - What does `type(a_tuple)` tell you about a_tuple?


### Dictionaries

A dictionary is a container that holds pairs of objects - keys and values.

Dictionaries work a lot like lists - except that you index them with keys. You can think about a key as a name or unique identifier for the value it corresponds to.

To add an item to the dictionary we assign a value to a new key:


**Changing dictionaries**

- First, print the value of the `rev` dictionary to the screen.
- Reassign the value that corresponds to the key second so that it no longer reads `“two”` but instead `2`.
- Print the value of `rev` to the screen again to see if the value has changed.

### Functions

Functions are used when a section of code needs to be repeated at various different points in a program. It saves you re-writing it all. In reality you rarely need to repeat the exact same code. Usually there will be some variation in variable values needed. Because of this, when you create a function you are allowed to specify a set of parameters which represent variables in the function.

In our use of the print function, we have provided whatever we want to print, as a parameter. Typically whenever we use the print function, we pass a different parameter value.

The ability to specify parameters make functions very flexible.Defining a section of code as a function in Python is done using the `def` keyword. For example a function that takes two arguments and returns their sum can be defined as:

![def%20keyword.png](attachment:def%20keyword.png)

### Conditionals

Similar in structure with `for` loops. 
* First line opens with if and ends with a colon
* Body containing one or more statements is indented (usually by 4 spaces)
    
Use `if` statements to control whether or not a block of code is executed. 

Conditionals are often used inside loops

Use `else` to execute a block of code when an if condition is not true

Use `elif` to specify additional tests

Order is important. Conditions are executed in order.

### Loops

A `for` loop can be used to access the elements in a list or other Python data structure one at a time:

Indentation is very important in Python. Note that the second line in the example above is indented.

It is really common to want to do something lots of times - for example, to change every item in a list, or to print a sequence of numbers. This is often done with a `'for'` loop. In the loop you have:

   - a list of things to run the loop on,
   - a variable, which will be set to each of the things in the list
   - a block of code, which will be carried out for each thing

It looks like this:

![Flowchart%20of%20for%20loop.png](attachment:Flowchart%20of%20for%20loop.png)

Using `for` loops with dictionaries is a little more complicated. We can do this in two ways:

You can also work over a range of numbers, using range(). This is particulary useful is you need to do a certain things for a fixed number of times.

You can combine for loops with if and else:

You can also put for loops inside each other - for instance to go through every square in a 3x3 grid:

There is also the `while loop`. The while loop is used to execute a block of statements repeatedly until a given condition is satisfied. 

![Flowchart%20of%20for%20loop%281%29.png](attachment:Flowchart%20of%20for%20loop%281%29.png)

## Libraries

[Common libraries cheatsheet](https://www.python-graph-gallery.com/cheat-sheets/)

[Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

The term "library" is used to describe a code library, a file usually containing a set of functions or precompiled codes that can be used later on in a program for some specific well-defined operations.

- A library is a collection of files (called modules) that contains functions for use by other programs.
  - May also contain data values (e.g., numerical constants) and other things.
  - Library’s contents are supposed to be related, but there’s no way to enforce that.
- The Python standard library is an extensive suite of modules that comes with Python itself.
- Many additional libraries are available from PyPI (the Python Package Index).
- We will see later how to write new libraries.

Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

Once we’ve imported the library, we can ask the library to read our data file for us.

<span style="color:red">**A program must import a library module before using it**</span>

#### Pandas

What it does: Provides access to efficient data structures for structured and time-series data. Pandas is a widely-used Python library for statistics, particularly on tabular data. Borrows many features from R’s dataframes.

A brief rundown of the features offered by Pandas include:

- An efficient DataFrame object for data manipulation

- Easy reshaping and pivoting of data sets

- Merging and joining of data sets

- Label-based data slicing, indexing, and subsetting

- Allows working with time-series data

- And other crucial tools for reading and writing data into multiple formats, even between in-memory data structures (Source: towards data science)

#### Seaborn

What it does: Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.

Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them. (Source: seaborn)

#### Numpy

What it does: Provides access to N-dimensional arrays and other useful numerical tools.

The two vital benefits that NumPy has to offer is the support for powerful N-dimensional array objects and built-in tools for performing intensive mathematical as well as scientific calculations. (Source: towards data science)

#### PyTorch

What it does: Provides tools and libraries for developing GPU-powered Machine Learning applications

PyTorch is being used more commonly to research, develop, and deploy applications that leverage advanced technologies like Computer Vision and Natural Language Processing. If needed, PyTorch can also pair well with other powerful libraries like NumPy, SciPy, and Cython. (Source: towards data science)

#### Matplotlib

What it does: Helps developers create stunning visualizations. 
    
Matplotlib is one of the most popular visualization libraries for Python. 
Being used by hundreds of companies and individuals, matplotlib lets you visualize your data in several different ways. (Source: towards data science)


#### Scikit-learn

What it does: It is a famous Python library to work with complex data. Scikit-learn is an open-source library that supports machine learning. It supports variously supervised and unsupervised algorithms like linear regression, classification, clustering, etc. This library works in association with Numpy and SciPy. (Source: geeks for geeks)

##### Examples

<span style="color:red">**Use help to learn about the contents of a library module**</span>

<span style="color:red">**Attention: You'll often see the following two ways of importing things:**
- `import pandas as pd`. This is just like doing `import pandas`, but where you would write e.g. `pandas.read`, you write `pd.read`. You can replace `pd` with anything, so long as you are consistent. So you could write `import pandas as balloon` if you're happy to keep writing `balloon.read` every time you want to use the function. It's mostly used to make things shorter, and there are some common short names that people use - if you stick to those, it'll be easier to share code. Examples `import random as rn`, `import seaborn as sns`.
- `from random import choice`. This is a bit different. Now, instead of writing `random.choice([True,False])` in your code, you can just write `choice([True,False])`. This makes your code shorter, but it's harder for someone else to figure out what `choice` means, or what library it belongs to. You can import lots of things this way, e.g. `from random import choice, randrange`.</span>

**Check the libraries installed**

- Use *import ... as ...* to give a library a short alias while importing it.
- Then refer to items in the library using that shortened name.
- Commonly used for libraries that are frequently used or have long names.
    - E.g., the matplotlib plotting library is often aliased as mpl.
- But can make programs harder to understand, since readers must learn your program’s aliases.

## Pandas and Numpy: when and how to use them

![numpy-and-pandas.png](https://github.com/dsmanufacturing/dsmanufacturing.github.io/blob/master/images/Screenshot%202022-01-27%20at%2002-57-32%20Difference%20between%20Pandas%20VS%20NumPy%20-%20GeeksforGeeks.png?raw=true) Source: GeeksforGeeks

### Loading csv numerical data with numpy

[Numpy cheatsheet](http://datacamp-community-prod.s3.amazonaws.com/ba1fe95a-8b70-4d2f-95b0-bc954e9071b0)

The expression *numpy.loadtxt(...)* is a [function call] that asks Python to run the [function] *loadtxt* which belongs to the *numpy* library. This [dotted notation] is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, John Smith is the *John* that belongs to the *Smith* family. We could use the dot notation to write his name *smith.john*, just as *loadtxt* is a function that belongs to the *numpy* library.

*numpy.loadtxt* has two [parameters]: the name of the file we want to read and the [delimiter]) that separates values on a line. These both need to be character [strings] (or strings for short), so we put them in quotes.

Since we haven’t told it to do anything else with the function’s output, the notebook displays it. In this case, that output is the data we just loaded. By default, only a few rows and columns are shown (with ... to omit elements when displaying big [arrays]). Note that, to save space when displaying NumPy arrays, Python does not show us trailing zeros, so 1.0 becomes 1..

Our call to *numpy.loadtxt* read our file but didn’t save the data in memory. To do that, we need to assign the array to a variable. In a similar manner to how we assign a single value to a variable, we can also assign an array of values to a variable using the same syntax. Let’s re-run *numpy.loadtxt* and save the returned data:

<span style="color:red">**Attention: functions can be inherited, functions like df.columns() can only be used after pandas is imported. functions like print() can be used no matter when and where as they are built-in functions** </span>

This statement doesn’t produce any output because we’ve assigned the output to the variable data. If we want to check that the data have been loaded, we can print the variable’s value:

Now that the data are in memory, we can manipulate them. First, let’s ask what type of thing data refers to:

The output tells us that the *data* array variable contains 100 rows and 30 columns. When we created the variable *data* to store our parts data, we did not only create the array; we also created information about the array, called members or attributes. This extra information describes *data* in the same way an adjective describes a noun. *data.shape* is an attribute of *data* which describes the dimensions of *data*. We use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.

### Index and slicing 

If we want to get a single number from the array, we must provide an [index](https://swcarpentry.github.io/python-novice-inflammation/reference.html#index) in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our data has two dimensions, so we will need to use two indices to refer to one specific value:

The expression data[50, 15] accesses the element at row 50, column 15. While this expression may not surprise you, data[0, 0] might. Programming languages like Fortran, MATLAB and R start counting at 1 because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because it represents an offset from the first value in the array (the second value is offset by one index from the first value). This is closer to the way that computers represent arrays (if you are interested in the historical reasons behind counting indices from zero, you can read [Nils Reichert's post](https://medium.com/@nilschristianreichert/why-zero-index-99e288a8b500)). As a result, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second. It takes a bit of getting used to, but one way to remember the rule is that the index is how many steps we have to take from the start to get the item we want.

"data" is a 3 by 3 numpy array containing row 0: ['A', 'B', 'C'], row 1: ['D', 'E', 'F'], and row 2: ['G', 'H', 'I']. Starting in the upper left hand corner, data[0, 0] = 'A', data[0, 1] = 'B', data[0, 2] = 'C', data[1, 0] = 'D', data[1, 1] = 'E', data[1, 2] = 'F', data[2, 0] = 'G', data[2, 1] = 'H', and data[2, 2] = 'I', in the bottom right hand corner.

![python-zero-index](https://swcarpentry.github.io/python-novice-inflammation/fig/python-zero-index.svg)

## Pandas: explore and analyse data

### Loading tabular data using pandas 

To begin processing data, we need to load it into Python. We can do that using the library pandas.

- Load it with import pandas as pd. The alias pd is commonly used for Pandas.
- Read a Comma Separated Values (CSV) data file with pd.read_csv.
    - Argument is the name of the file to be read.
    - Assign result to a variable to store the data that was read.


### Investigating the data

### Renaming columns

### Check dataset

### Use `DataFrame.T` to transpose a dataframe.

### Use `DataFrame.describe()` to get summary statistics about data.

### Use `index_col` to specify that a column’s values should be used as row headings.

### Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

### Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

### Use `:` on its own to mean all columns or all rows.

### Bash commands in Jupyter Notebook

You can use the **%whos** command at any time to see what variables you have created and what modules you have loaded into the computer’s memory. As this is an IPython command, it will only work if you are in an IPython terminal or the Jupyter Notebook.

In [None]:
%whos

The **%pwd** command can be used to see the current directory in which this notebook was opeed and it stands for print walking directory.

In [None]:
%pwd

The **%ls** command gives you the list in your current directory.

In [None]:
%ls

In [None]:
%lsmagic