# UBC
## Programming in Python for DS

### Week 7
Instructor: Socorro Dominguez-Vidana

- [] Describe what Python libraries are, as well as explain when and why they are useful.
- [] Identify where code can be improved concerning variable names, magic numbers, comments and whitespace.
- [] Write code that is human readable and follows the black style guide.
- [] Import files from other directories.
- [] Use pytest to check a function's tests.
- [] When running pytest, explain how pytest finds the associated test functions.
- [] Explain how the Python debugger can help rectify your code.

### Python libraries 
- Collections of pre-written code or modules that provide various functionalities to perform specific tasks.
- These libraries contain reusable classes (for example `DataFrame`), functions (e.g. `groupby()`, `sortby()`) and constants that can be imported and utilized in your Python programs.
- Libraries save time and effort by offering ready-made solutions to common problems. 

#### Importing a python library

Installing a library
```
conda install pandas
```

```
pip install pandas
```


```python
import pandas as pd
```
- `import` is a key word
- `pandas` is the library we have used all this time
- `as` we are stating an alias
- `pd` so that we can write `pd.read_csv` instead of `pandas.read_csv`

In [None]:
import pandas as pd
pd.read_csv()

```python
import pandas as pd
```

```python
pd.read_csv('data/file.csv')

```python
from pandas import read_csv
```

In [1]:
from pandas import read_csv
globals()

{'__name__': '__main__',
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__package__': None,
 '__loader__': None,
 '__spec__': None,
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '_ih': ['', 'from pandas import read_csv\nglobals()'],
 '_oh': {},
 '_dh': [PosixPath('/Users/sedv8808/HT-Data/Instructor/UBC/Instructor/UBC_EL_PPDS')],
 'In': ['', 'from pandas import read_csv\nglobals()'],
 'Out': {},
 'get_ipython': <bound method InteractiveShell.get_ipython of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x10526e9d0>>,
 'exit': <IPython.core.autocall.ZMQExitAutocall at 0x10527f110>,
 'quit': <IPython.core.autocall.ZMQExitAutocall at 0x10527f110>,
 'open': <function io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)>,
 '_': '',
 '__': '',
 '___': '',
 '__session__': '/Users/sedv8808/HT-Data/Instructor/UBC/Instructor/UBC_EL_PPDS/Module7class.ipynb

### Code Improvements

- Code you write will be used in the future. Make sure to aim for best practices:
    - **Variable Names:** Use descriptive and meaningful names that indicate the purpose of the variable. Avoid single-letter names or ambiguous terms.
    - **Magic Numbers:** Replace hardcoded numbers with named constants or variables to enhance readability and maintainability.
    - **Comments:** Add comments to clarify complex code sections, explain the intent behind the logic, or provide context where the code might be unclear.
    - **Whitespace:** Ensure consistent indentation, use whitespace to enhance code readability, and follow the [PEP 8](https://peps.python.org/pep-0008/) style guide.

```python
def func(x,y):
    '''Docstrings'''
    z = 3 + y
    # Filtering by y variable
    return z
```

- You cannot always do it perfectly. But to help you we have `flake8` and `black`

 ```python
flake8 <file>.py
black <file>.py
```

- `flake8` will help you find style errors
- `black` will correct style errors

Let's see how to use them in the terminal and in here:

In [2]:
!flake8 sampling.py

[1msampling.py[m[36m:[m4[36m:[m43[36m:[m [1m[31mE251[m unexpected spaces around keyword / parameter equals
[1msampling.py[m[36m:[m4[36m:[m45[36m:[m [1m[31mE251[m unexpected spaces around keyword / parameter equals
[1msampling.py[m[36m:[m8[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m18[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m23[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m31[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m33[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m35[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m36[36m:[m35[36m:[m [1m[31mW291[m trailing whitespace
[1msampling.py[m[36m:[m37[36m:[m25[36m:[m [1m[31mE222[m multiple spaces after operato

In [None]:
# We will run this later.
#!black sampling.py

### Import functions from other files

So far, we have defined functions in the notebook and used them there: 

```python
def sample_dataframe(data, grouping_col, N = 1):
    df_grouped = data.groupby(grouping_col)
    
    sampled_df = None
    
    for group, rows in df_grouped: 
        group_sampling =  df_grouped.get_group(group).sample(N)
        sampled_df = pd.concat([sampled_df, group_sampling])
    
    return sampled_df
sample_dataframe(data, 'column', N=3)
```

But that breaks the flow of the notebook.

What if we could write the function elsewhere but use it in our notebook to not break the flow?

#### How to call your own functions

- You have been importing whole sets of functions when you do:
```python
import pandas as pd
```

- But if you only wanted to import **one** function from pandas, for example, `pd.read_csv` you could do:

```python
from pandas import read_csv
```

# What about personal functions / scripts

- I have created the `sampling.py` in my local computer. 
- There is a function called `sample_dataframe`

Explore doing

1) 
```python
import sampling
sampling.sample_dataframe(data, 'neighbourhood')
```
- `import` - keyword to bring from a different file

2) 
```python
from sampling import sample_dataframe
sample_dataframe(data, 'neighbourhood')
```
- `from` - keyword to say that from a particular script, we are only handpicking what we want to use

3) 
```python
import sampling as sp
sp.sample_dataframe(data, 'neighbourhood')
sp.sample_dataframe2(data, 'neighbourhood')
```
- `as` - keyword for creating an alias

##### Example

In [3]:
# Toy Dataset 
import pandas as pd
a_dict = {'employee_id': [1873, 4913, 4801, 4540, 3581,
                   4534, 1934, 4944, 1983, 1266], 
           'name': ['Josh', 'Laura', 'Hayley', 
                    'Mike', 'Tiffany', 'Anurag',
                    'Rocio', 'Eric', 'Monique',
                    'Emma'], 
            'neighbourhood': ['Sunset','West end','Kitsilano', 'Sunset', 
                              'Arbutus-ridge','Arbutus-ridge', 'Kitsilano', 
                              'West end','Kitsilano', 'Arbutus-ridge'],
            'type': ['full-time', 'part-time', 'part-time', 'full-time', 'part-time',
                     'full-time', 'full-time', 'part-time', 'part-time', 'full-time'],
            'hourly_rate': [25.0, 27.0, 30.0, 25.5, 32.0,
                         26.5, 27.0, 28.0, 25.5, 23.0]}

data = pd.DataFrame.from_dict(a_dict)
data.head()

Unnamed: 0,employee_id,name,neighbourhood,type,hourly_rate
0,1873,Josh,Sunset,full-time,25.0
1,4913,Laura,West end,part-time,27.0
2,4801,Hayley,Kitsilano,part-time,30.0
3,4540,Mike,Sunset,full-time,25.5
4,3581,Tiffany,Arbutus-ridge,part-time,32.0


In [5]:
import sampling as sp
#globals()
sp.sample_dataframe(data, 'neighbourhood')

Unnamed: 0,employee_id,name,neighbourhood,type,hourly_rate
5,4534,Anurag,Arbutus-ridge,full-time,26.5
2,4801,Hayley,Kitsilano,part-time,30.0
3,4540,Mike,Sunset,full-time,25.5
1,4913,Laura,West end,part-time,27.0


Is `sampling.py` formatted properly?

Verify with `flake`

In [6]:
!flake8 sampling.py

[1msampling.py[m[36m:[m4[36m:[m43[36m:[m [1m[31mE251[m unexpected spaces around keyword / parameter equals
[1msampling.py[m[36m:[m4[36m:[m45[36m:[m [1m[31mE251[m unexpected spaces around keyword / parameter equals
[1msampling.py[m[36m:[m8[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m18[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m23[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m31[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m33[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m35[36m:[m1[36m:[m [1m[31mW293[m blank line contains whitespace
[1msampling.py[m[36m:[m36[36m:[m35[36m:[m [1m[31mW291[m trailing whitespace
[1msampling.py[m[36m:[m37[36m:[m25[36m:[m [1m[31mE222[m multiple spaces after operato

In [7]:
!black sampling.py

[1mreformatted sampling.py[0m

[1mAll done! ✨ 🍰 ✨[0m
[34m[1m1 file [0m[1mreformatted[0m.


In [9]:
!flake8 sampling.py

### About Docstrings
A programming function is written as:

```
def <function_name>([<parameters>]):
    '''
    Docstrings
    '''
    <statement(s)>
    <return>
```

How do we know how to use the function? The Docstrings section is extremely important.

In [10]:
### Accessing the docstrings or documentation

?sp.sample_dataframe

[0;31mSignature:[0m [0msp[0m[0;34m.[0m[0msample_dataframe[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mgrouping_col[0m[0;34m,[0m [0mN[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Given a dataframe, return a smaller sample of the dataframe
sampling N rows from each specified group

Parameters
----------
data : pandas.core.frame.DataFrame
    The dataframe to sample from
grouping_col : str
    The column to filter our condition on
N : int, optional
    The number of rows to sample from each group (The default value is 1
    which implies a single observation)
    
Returns
-------
pandas.core.frame.DataFrame
    The new sampled dataframe
    
Examples
--------
>>> sample_dataframe(pokemon, 'legendary'])
    name     deck_no  attack  defense  type    gen  legendary
411 Burmy     412     29        45      bug     4      0
640 Tornadus  641     100       80      flying  5      1
[0;31mFile:[0m      ~/HT-Data/Instructor/UBC/Instructor/UBC

### Testing files with `pytest`

- `Pytest` is a testing framework in Python.
    - To test a function, create test functions with names starting with "test_" in a file, and then run pytest in the terminal to execute the tests.

**Notes to that:**
- Testing files, for order, usually live in another folder called tests
- For now, we will have them on the same directory.

- `pytest` identifies test functions by looking for functions whose names start with `"test_"`. When you run `pytest`, it automatically discovers and executes these test functions.
    - We do not currently want that because the assignments have a different kind of test (the autograder) - so you **must** specify which file you want to test:

```python
pytest sampling.py
```

In [11]:
!pytest test_sampling.py

platform darwin -- Python 3.11.4, pytest-7.4.1, pluggy-1.3.0
rootdir: /Users/sedv8808/HT-Data/Instructor/UBC/Instructor/UBC_EL_PPDS
plugins: cov-4.1.0
collected 6 items                                                              [0m[1m

test_sampling.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31m                                                  [100%][0m

[31m[1m________________________________ test_sd_cherry ________________________________[0m

    [94mdef[39;49;00m [92mtest_sd_cherry[39;49;00m():[90m[39;49;00m
        raw = {[33m'[39;49;00m[33mid[39;49;00m[33m'[39;49;00m: [[94m1873[39;49;00m, [94m4913[39;49;00m, [94m4801[39;49;00m, [94m4540[39;49;00m, [94m3581[39;49;00m,[90m[39;49;00m
                       [94m4534[39;49;00m, [94m1934[39;49;00m, [94m4944[39;49;00m, [94m1983[39;49;00m, [94m1266[39;49;00m],[90m[39;49;00m
               [33m'[39;49;00m[33mname[39;49;00m[33m'[39;49;00m: [[33m'[39;49;00m[33mEnglish

In [None]:
pwd

## Explaining for loops in groupby objects.

In [12]:
a_dict = {'a': 4, 'b':7, 'c':5}

In [13]:
a_dict.items()

dict_items([('a', 4), ('b', 7), ('c', 5)])

In [14]:
for k in a_dict:
    print(k)

a
b
c


In [15]:
for k, v in a_dict.items():
    print(k, v)

a 4
b 7
c 5


```python
for key, value in my_dict.items():
        print(key)
        print(my_dict[key])
        print(value)
```

In [16]:
import pandas as pd
a_dict = {'employee_id': [1873, 4913, 4801, 4540, 3581,
                   4534, 1934, 4944, 1983, 1266], 
           'name': ['Josh', 'Laura', 'Hayley', 
                    'Mike', 'Tiffany', 'Anurag',
                    'Rocio', 'Eric', 'Monique',
                    'Emma'], 
            'neighbourhood': ['Sunset','West end','Kitsilano', 'Sunset', 
                              'Arbutus-ridge','Arbutus-ridge', 'Kitsilano', 
                              'West end','Kitsilano', 'Arbutus-ridge'],
            'type': ['full-time', 'part-time', 'part-time', 'full-time', 'part-time',
                     'full-time', 'full-time', 'part-time', 'part-time', 'full-time'],
            'hourly_rate': [25.0, 27.0, 30.0, 25.5, 32.0,
                         26.5, 27.0, 28.0, 25.5, 23.0]}

data = pd.DataFrame.from_dict(a_dict)
data

Unnamed: 0,employee_id,name,neighbourhood,type,hourly_rate
0,1873,Josh,Sunset,full-time,25.0
1,4913,Laura,West end,part-time,27.0
2,4801,Hayley,Kitsilano,part-time,30.0
3,4540,Mike,Sunset,full-time,25.5
4,3581,Tiffany,Arbutus-ridge,part-time,32.0
5,4534,Anurag,Arbutus-ridge,full-time,26.5
6,1934,Rocio,Kitsilano,full-time,27.0
7,4944,Eric,West end,part-time,28.0
8,1983,Monique,Kitsilano,part-time,25.5
9,1266,Emma,Arbutus-ridge,full-time,23.0


In [17]:
data.groupby('type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1052823d0>

In [23]:
data.groupby('type').get_group('part-time')

Unnamed: 0,employee_id,name,neighbourhood,type,hourly_rate
1,4913,Laura,West end,part-time,27.0
2,4801,Hayley,Kitsilano,part-time,30.0
4,3581,Tiffany,Arbutus-ridge,part-time,32.0
7,4944,Eric,West end,part-time,28.0
8,1983,Monique,Kitsilano,part-time,25.5


In [19]:
grouped_df = data.groupby('type')

In [21]:
grouped_df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1085aa3d0>

What would it be then to do a for loop in a grouped data frame?

```python
for ___ in grouped_df:
```

In [22]:
grouped_df.groups

{'full-time': [0, 3, 5, 6, 9], 'part-time': [1, 2, 4, 7, 8]}

In [24]:
for group, rows in grouped_df:
    print("Group:", group)
    print("Rows:", rows)

Group: full-time
Rows:    employee_id    name  neighbourhood       type  hourly_rate
0         1873    Josh         Sunset  full-time         25.0
3         4540    Mike         Sunset  full-time         25.5
5         4534  Anurag  Arbutus-ridge  full-time         26.5
6         1934   Rocio      Kitsilano  full-time         27.0
9         1266    Emma  Arbutus-ridge  full-time         23.0
Group: part-time
Rows:    employee_id     name  neighbourhood       type  hourly_rate
1         4913    Laura       West end  part-time         27.0
2         4801   Hayley      Kitsilano  part-time         30.0
4         3581  Tiffany  Arbutus-ridge  part-time         32.0
7         4944     Eric       West end  part-time         28.0
8         1983  Monique      Kitsilano  part-time         25.5
