## Part 1: Importing, examining, and updating values in a dataset

In [None]:
import pandas as pd

**Use the `.read_csv()` method to import the file `libraries.csv` from the `data` folder, assigning the result to `df`:**

In [None]:
df = pd.read_csv('data/libraries.csv')


**Check that the file has been imported as expected using the `.head()` method to look at the first `3` rows of `df`:**

In [None]:
df.head(3)


**Use the `.shape` attribute to find out the number of rows and columns in `df`, assigning the number of rows to `rows` and number of columns to `cols`:**

In [None]:
shape = df.shape
rows = shape[0]
cols = shape[1]


In [None]:
print(rows)
print(cols)

**Use the `.describe()` method on `df` to look at some statistics for the numerical data columns:**

In [None]:
df.describe()


Notice that the value for `min` in both columns is negative, which suggests an error in the data.

**You're told that there is an error in the row with index `2233`; use `.loc[]` to have a look at this row:**

In [None]:
df.loc[2233]


**Using `.loc[]`, update the values for both `Weekly hours open` and `Weekly hours staffed` on this row to `57`:**

In [None]:
df.loc[2233, ['Weekly hours open', 'Weekly hours staffed']] = 57


- Run your code again for the previous question to check that both values on the row at index `2233` have been updated.

The `min` value for these columns (which can be seen again using the `.describe()` method on the DataFrame, or the `.min()` method on a given column), should now be `0`:

In [None]:
df.loc[:, ['Weekly hours open','Weekly hours staffed']].min()

In [None]:
df.loc[:, ['Weekly hours open','Weekly hours staffed']].describe()

## Part 2: Filtering, sorting, and modifying DataFrames

**Use the `.set_index()` method to use the values from the `Library name` column as the index of `df`, using the parameter `inplace=True`:**

In [None]:
df.set_index('Library name', inplace=True)


**Use the `.drop()` method to remove the `Notes` column:**
- remember to use `axis=1`
- use `inplace=True`

In [None]:
df.drop('Notes', axis=1, inplace=True)


**Create a function called `open_2016` which takes two parameters called `data` and `name`, where `data` is a DataFrame (such as `df`) and `name` is a string (such as the values in `Library name`, which we have set as the index), and returns the value in the `In use 2016` column for the row whose index is equal to `name`:**

- use the `.loc[]` method

In [None]:
def open_2016(data, name):
    
    val = data.loc[name, 'In use 2016']
    
    return val 



In [None]:
open_2016(df, 'Barking')

**Use the `.value_counts()` method on `df['In use 2010']` to see what different values there are in that column:**

In [None]:
df['In use 2010'].value_counts()

**Create a function called `is_open()`, which take a single value and returns a boolean value as follows:**
- `True` if the value equals `'yes'` or `'Yes'`
- `False` for any other value

In [None]:
'''
Notice below that we do not need to use the if... else... statement construction.
- when evaluating to a boolean we can use the more succinct form shown

'''


def is_open(entry):
    
    return entry in ['yes', 'Yes']




In [None]:
is_open('No'), is_open('yes')

**Use the `.apply()` method with your `is_open()` function and each of the columns `In use 2010` and `In use 2016`, to create new columns called `open_2010` and `open_2016` respectively, each containing Boolean values returned by the function:**

In [None]:
df['open_2010'] = df['In use 2010'].apply(is_open)
df['open_2016'] = df['In use 2016'].apply(is_open)


In [None]:
df[['open_2010', 'open_2016']].dtypes

In [None]:
df['open_2010'].value_counts(), df['open_2016'].value_counts()

**Create a new column in `df` called `open_both`, which contains a Boolean `True` for entries where both `open_2010` and `open_2016` are `True`:**

In [None]:
df['open_both'] = df['open_2010'] & df['open_2016']


In [None]:
df['open_both'].value_counts()

**Assign to `df_open` a DataFrame containing only entries where `open_both == True`:**
- use the `.copy()` method so that `df_open` can be subsequently modified without affecting `df`

In [None]:
df_open = df[df['open_both']==True].copy()


In [None]:
len(df_open) == df['open_both'].value_counts()[True]

**Calculate the `.mean()` of the values in the `Weekly hours open` column:**

In [None]:
df_open['Weekly hours open'].mean()


**Calculate the percentage of entries in `df_open` where `['Weekly hours open'] == 0`:**

In [None]:
len(df_open[df_open['Weekly hours open'] == 0]) / len(df_open) * 100

Converting the zero values to `NaN` values will result those values being excluded from subsequent calculations of, for example, `.mean()`.

These `NaN` values ('Not a Number') represent 'missing data' (of all types, not just numeric) in pandas.

However (like many component of the data structures used by pandas) they originate from the `numpy` package, and thus to create them we need to `import numpy`, by convention `as np`:

In [None]:
import numpy as np

**`.apply()` a `lambda` function to modify `Weekly hours open` so that `if` the value `== 0` it is updated to `np.nan`, `else` it is unchanged:**

In [None]:
df_open['Weekly hours open'] = df_open['Weekly hours open'].apply(lambda x: np.nan if x == 0 else x)


In [None]:
df_open['Weekly hours open'].value_counts()

Run your earlier code to calculate the `.mean()` of `Weekly hours open` to see if the value is now different.

In [None]:
hours_in_week = 24 * 7

**Assign to df_errors a DataFrame containing entries from `df_open` where the value for `Weekly hours open` exceeds `hours_in_week`:**
- use the `.copy()` method when creating the new DataFrame

In [None]:
df_errors = df_open[df_open['Weekly hours open'] > hours_in_week].copy()


In [None]:
df_errors

**Use the `sort_values()` method on `df_errors` to look at the entries with the highest values for `Weekly hours open` at the top, i.e. in descending order:**

In [None]:
df_errors.sort_values(by='Weekly hours open', ascending=False)


**Use the `.unique()` series method and the `.tolist()` method to create a list of the unique values for `Library service` in `df_errors`:**

In [None]:
df_errors['Library service'].unique().tolist()
