In [None]:
# import libraries we'll use 

import pandas as pd 

In [None]:
# import dataframe we created and saved as csv 

# this is the same as the one we created in Pandas part 2

transactions = pd.read_csv("../data/mer_utfyllende_transaksjon.csv")

transactions.head()

## Dates in Python 

Usually you'll have your dates as a variable in Pandas dataframe. If you just use strandard pd.read_csv() function to create your dataframe, Python will treat your date variable as string. We need to transform them into dates. Also, dates can have different formats, so we need to specify that.

You can also add `parse_dates = True` parameter when importing the data with read_csv, but it will not always work (depending on date format).


In [None]:
# in transaction dataframe, we have two dates, transaction date (logTimestamp) and date of birth (DOB)
date_format = "%Y-%m-%d %H:%M:%S"

transactions['logTimestamp'] = pd.to_datetime(transactions['logTimestamp'], format=date_format)

transactions.head()

### Datoformat

sjekk ut for datoformat i python [her](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)

In [None]:
# check your df 

transactions.info()

In [None]:
# now we can access specific elements of the date within each column

transactions['tran_year'] = transactions['logTimestamp'].dt.year
transactions['tran_month'] = transactions['logTimestamp'].dt.month
transactions['tran_week'] = transactions['logTimestamp'].dt.isocalendar().week
transactions['tran_day'] = transactions['logTimestamp'].dt.day
transactions['tran_day_name'] = transactions['logTimestamp'].dt.day_name()
transactions['tran_month_name'] = transactions['logTimestamp'].dt.month_name()

transactions.head()

In [None]:
# now you can investigate your data further 

# for example, you can count transactions per day 

transactions['tran_day_name'].value_counts() 

In [None]:
# let's check what pd.to_datetime("today") is 

pd.to_datetime("today")

## Missing values 

Missing values in dataset is one of the most common challenges you'll encounter when working with data. 

In order to decide how to handle this problem, first you need to understand why do you have missing data in the first place. 

When handling missing data, you can: 

- Remove that observation
- Remove the variable
- Replace missing value with mean/median or 0 (numerical variables)
- Replace missing value with mode or extra category called “Unknown” (categorical variables)


In [None]:
# How many missing values we have in the dataset?

transactions.isnull().sum()

## Changing Column Types 

Sometimes your columns will get imported in pandas dataframe as wrong data type and you will need to change it. 

Often you will want to change integers to strings (if they are codes), strings to integers, integers to floats or floats to integers.

To do that, we use _astype()_ function. 


In [None]:
transactions.dtypes # let's check again column types 

In [None]:
transactions.price.unique()

In [None]:
# we will turn price from float to int

transactions['price'] = transactions['price'].astype('int')

In [None]:
transactions.dtypes # let's check again column types 

In [None]:
transactions['price'] = transactions['price'].astype('float')

## Remove duplicates 

What is the primary key here?

- id?

In [None]:
primary_cols = ["id"]

In [None]:
transactions.duplicated(subset=primary_cols).any()

id er jo bare et løpenummer. Hva kan tenkes at burde være unikt her?

- userId og logTimestamp??

In [None]:
primary_cols = ["userId", "logTimestamp"]

In [None]:
transactions.duplicated(subset=primary_cols).any()

In [None]:
transactions.loc[transactions.duplicated(subset=primary_cols, keep=False)].sort_values("logTimestamp")

Legg merke til de to nederste radene

- userId og logTimestamp er like, men produktene er forskjellige
- Det betyr kanskje at alt det andre heller ikke er duplikater

In [None]:
# transactions = transactions.drop_duplicates(subset=primary_cols)
# transactions.shape

## Outliers

In statistics, **an outlier** is an observation point that is distant from other observations (different than most observations).

**z-value** or *standard score* is a standard measure of how much out of the ordinary a data point is.

It is defined as:

$$z = \frac{x - \mu}{\sigma}$$

where $\mu$ is the *mean* and $\sigma$ the *standard deviation*,

$$\sigma = \sqrt{\frac 1 {N-1} \sum_{i=1}^N (x_i - \mu)^2}$$

In [None]:
# we can calculate z-value for variable price

transactions['zscore'] = (transactions["price"] - transactions["price"].mean()) / transactions["price"].std()

In [None]:
print(transactions["price"].mean())

print(transactions["price"].std())

In [None]:
# let's try to visualize our zscore

import matplotlib as mpl
import matplotlib.pyplot as plt 

transactions.plot(y="zscore", figsize = (20, 7))

In [None]:
transactions.boxplot(column="zscore")

In [None]:
transactions[["zscore"]].describe()

In [None]:
# obviously most of them are around 0, and some of them are much higher 

# let's look at the data where zcore is greater than 3 - that is the data 

outliers = transactions[transactions["zscore"] > 3]
outliers

However, these don't seem to be outliers. It's just more expensive stuff

In [None]:
outliers[["price", "stockName"]].min()

In [None]:
outliers[["price", "stockName"]].max()

## Save data

In [None]:
# and now we save our dataframe to file again 

transactions.to_csv('../data/transactions_dataframe_newest.csv', index=False) 