<a href="https://colab.research.google.com/github/Aust0/MicrosoftLearning-DP-200-Implementing-an-Azure-Data-Solution-/blob/master/02_Python_Pandas_DataSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import datetime

## DataSeries

In [None]:
my_readings = [3.12, 3.54, 3.24, 3.67, 3.56, 3.87]

We can create a DataSeries using any list or np.array,

In [None]:
ds_readings = pd.Series(my_readings)
ds_readings

0    3.12
1    3.54
2    3.24
3    3.67
4    3.56
5    3.87
dtype: float64

In [None]:
## Output Index
ds_readings.index

RangeIndex(start=0, stop=6, step=1)

In [None]:
## Output Value
ds_readings.values

array([3.12, 3.54, 3.24, 3.67, 3.56, 3.87])

In [None]:
ds_readings[0]

3.12

In [None]:
ds_readings[4]

3.56

In [None]:
ds_readings[1:4]

1    3.54
2    3.24
3    3.67
dtype: float64

**Masking**

In [None]:
## Note list input
ds_readings[[True, False, True, False, False, False]]

0    3.12
2    3.24
dtype: float64

In [None]:
ds_readings + 10

0    13.12
1    13.54
2    13.24
3    13.67
4    13.56
5    13.87
dtype: float64

In [None]:
ds_readings > 3.5

0    False
1     True
2    False
3     True
4     True
5     True
dtype: bool

### Q: Why might the above be useful?

Output would be very useful when used as a mask for another series

In [None]:
ds_readings[ds_readings > 3.5]

1    3.54
3    3.67
4    3.56
5    3.87
dtype: float64

We can always combine filtering and operators

In [None]:
ds_readings[ds_readings > 3.5] * 2

1    7.08
3    7.34
4    7.12
5    7.74
dtype: float64

And to change the values, either all, or partial is easy,

In [None]:
ds_readings[ds_readings > 3.5] = 0

In [None]:
ds_readings

0    3.12
1    0.00
2    3.24
3    0.00
4    0.00
5    0.00
dtype: float64

## Exercise
   * ### Create a new Dataseries on a topic of your choosing(numeric, length = 8)
   * ### Output the 2nd, last, and last two elements
   * ### Subtract a number from all elements
   * ### Generate and apply a mask
   * ### Use the mask to set values to 0

In [None]:
myds = pd.Series([12, 23, 34 ,45 ,56 ,67, 78, 89])
myds[1] 
myds[-1:]
myds[-2:]

myds-1

myds>40
myds[myds>40]=0
myds

0    12
1    23
2    34
3     0
4     0
5     0
6     0
7     0
dtype: int64

### How do I modify index?

In [None]:
timings = [datetime.time(1, 3, increment) for increment in range(6)]

In [None]:
timings

[datetime.time(1, 3),
 datetime.time(1, 3, 1),
 datetime.time(1, 3, 2),
 datetime.time(1, 3, 3),
 datetime.time(1, 3, 4),
 datetime.time(1, 3, 5)]

In [None]:
ds_readings.index = timings

In [None]:
ds_readings

01:03:00    3.12
01:03:01    0.00
01:03:02    3.24
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
dtype: float64

The `.value_counts` function finds all the unique values in the series and gives the number of ocurrences of the same number in the series,

In [None]:
ds_readings.value_counts()

0.00    4
3.12    1
3.24    1
dtype: int64

we can also sort, ascending and descending

In [None]:
ds_readings.sort_values()

01:03:01    0.00
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
01:03:00    3.12
01:03:02    3.24
00:03:07    3.48
00:03:06     NaN
dtype: float64

In [None]:
ds_readings.sort_values(ascending = False)

01:03:01    0.00
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
01:03:00    3.12
01:03:02    3.24
dtype: float64

### np.nan

`np.nan` refers to a value that should be but do not exist. And pandas provides an easiy function to check emptiness

We first add np.nan into the seies

In [None]:
ds_readings[datetime.time(0,3,6)] =  np.nan 

In [None]:
ds_readings

01:03:00    3.12
01:03:01    0.00
01:03:02    3.24
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
00:03:07    3.48
00:03:06     NaN
dtype: float64

In [None]:
ds_readings[datetime.time(0,3,7)] =  3.48 

In [None]:
ds_readings

01:03:00    3.12
01:03:01    0.00
01:03:02    3.24
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
00:03:07    3.48
00:03:06     NaN
dtype: float64

And pandas provides method for checking emptiness

In [None]:
ds_readings.isna()

01:03:00    False
01:03:01    False
01:03:02    False
01:03:03    False
01:03:04    False
01:03:05    False
00:03:07    False
00:03:06     True
dtype: bool

In [None]:
ds_readings[ds_readings.isna()]

00:03:06   NaN
dtype: float64

In [None]:
result = ds_readings[ds_readings.isna()]

In [None]:
result.index

Index([00:03:06], dtype='object')

To remove items, use drop. You need to refer to the index value.

In [None]:
ds_readings.drop(result.index)  #can insert after index; "", inplace=TRUE" to save the change to dsreadings

01:03:00    3.12
01:03:01    0.00
01:03:02    3.24
01:03:03    0.00
01:03:04    0.00
01:03:05    0.00
00:03:07    3.48
dtype: float64

In [None]:
ds_readings.isna()

01:03:00    False
01:03:01    False
01:03:02    False
01:03:03    False
01:03:04    False
01:03:05    False
00:03:07    False
00:03:06     True
dtype: bool

There are reduction methods 

In [None]:
ds_readings.isna().any()

True

In [None]:
ds_readings.isna().all()

False

In [None]:
ds_readings.isna().sum()

1

In [None]:
ds_readings.unique()

array([3.12, 0.  , 3.24, 3.48,  nan])

### Mappings

In [None]:
mapping = {0: 10.0, np.nan: 0.0}
ds_readings.replace(mapping)

01:03:00     3.12
01:03:01    10.00
01:03:02     3.24
01:03:03    10.00
01:03:04    10.00
01:03:05    10.00
00:03:07     3.48
00:03:06     0.00
dtype: float64

In [None]:
def myround(x):
    return round(x, 1)

In [None]:
myround (ds_readings)

01:03:00    3.1
01:03:01    0.0
01:03:02    3.2
01:03:03    0.0
01:03:04    0.0
01:03:05    0.0
00:03:07    3.5
00:03:06    NaN
dtype: float64

In [None]:
ds_readings.map(myround)

01:03:00    3.1
01:03:01    0.0
01:03:02    3.2
01:03:03    0.0
01:03:04    0.0
01:03:05    0.0
00:03:07    3.5
00:03:06    NaN
dtype: float64

In [None]:
ds_readings.map(lambda x: round(x,1))

01:03:00    3.1
01:03:01    0.0
01:03:02    3.2
01:03:03    0.0
01:03:04    0.0
01:03:05    0.0
00:03:07    3.5
00:03:06    NaN
dtype: float64

## Exercise
   * ### Change the index of the list you created in the previous exercise so that it is indexed by time
   * ### Insert several np.nan values
   * ### remove these values using `.isna()` and `.drop()`
   * ### Define a function which squares numbers given as input and apply it accross the list using `.map()`
    

In [None]:
myds = pd.Series([12, 23, 34 ,45 ,56 ,67, 78, 89])
myds

0    12
1    23
2    34
3    45
4    56
5    67
6    78
7    89
dtype: int64

In [None]:
mytimings = [datetime.time( 1, 0, increment) for increment in range(8)]
mytimings

[datetime.time(1, 0),
 datetime.time(1, 0, 1),
 datetime.time(1, 0, 2),
 datetime.time(1, 0, 3),
 datetime.time(1, 0, 4),
 datetime.time(1, 0, 5),
 datetime.time(1, 0, 6),
 datetime.time(1, 0, 7)]

In [None]:
myds.index=mytimings
myds

01:00:00    12
01:00:01    23
01:00:02    34
01:00:03    45
01:00:04    56
01:00:05    67
01:00:06    78
01:00:07    89
dtype: int64

In [None]:
myds[datetime.time(0,0,2)] =  np.nan
myds[datetime.time(1,2,0)] =  np.nan
myds[datetime.time(1,0,12)] =  np.nan
myds

01:00:00    12.0
01:00:01    23.0
01:00:02    34.0
01:00:03    45.0
01:00:04    56.0
01:00:05    67.0
01:00:06    78.0
01:00:07    89.0
00:00:02     NaN
01:02:00     NaN
01:00:12     NaN
dtype: float64

In [None]:
myds.isna()
myds[myds.isna()]
nans=myds[myds.isna()]
myds.drop(nans.index)

01:00:00    12.0
01:00:01    23.0
01:00:02    34.0
01:00:03    45.0
01:00:04    56.0
01:00:05    67.0
01:00:06    78.0
01:00:07    89.0
dtype: float64

In [None]:
def mysqr(x):
    return (x**2)
  
myds.map(lambda x: mysqr(x))

01:00:00     144.0
01:00:01     529.0
01:00:02    1156.0
01:00:03    2025.0
01:00:04    3136.0
01:00:05    4489.0
01:00:06    6084.0
01:00:07    7921.0
00:00:02       NaN
01:02:00       NaN
01:00:12       NaN
dtype: float64