# Mobility Data Analysis
1. Scoping
2.  Preprocress the data
3. Explolatory Data Analysis (EDA)
4. Generate user trips and other user attributes
5. Perfom analysis of user data to generate individual level metrics
6. Generate aggregate metrics such as OD

# Project Scoping

# Instructions
1. Please fill out all the places with "YOUR CODE HERE" with your code.
2. Fill in with your functions, in places where I ask you to do so.
3. Please answer all non-code questions where 

#  Python setup
Heree, we import all the required Python packages. In order to use, 
any other module which wasnt ```pip``` installed, such as ```mob_data_utils```,
you  can do the following:
```sys.append(full_path_to_module)```

In [1]:
# utility libraries
import os
from pathlib import Path
from functools import wraps
import time
from datetime import datetime

# data processing libraries
import pandas as pd
import numpy as np

# Apache Spark Modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col,udf
from pyspark.sql.types import *

# plotting library
import seaborn as sns
sns.set_style("white")
sns.set_context("poster", font_scale=1.25, rc={"lines.linewidth":1.25, "lines.markersize":8})

# local libraries (e.g., mob_data_utils)
# since mob_data_utils.py is in this dir
import mob_data_utils as ut

# Setup working directories
Its also important to setup commonly used diretories such as where you will be saving data

In [3]:
# We can use CAPS for these variables since they are constants
BASE_DIR = Path.cwd().parent
DATA_DIR = BASE_DIR.joinpath('data')
OUTPUTS_DIR = BASE_DIR.joinpath('outputs')

In [1]:
# Setup global parameters and variables
MISC_PROCESSING_PARAMS = {'distance_threshold': 2, 'min_unique_locs': 2,'datetime_col': 'datetime',
                        'userid': 'user_id','x': 'lon', 'y': 'lat'}

# Data preprocessing.
Often, after all the data has been  acquired, thee next step is to do some preprocessing on the raw data. 
The objectives of this task will vary depending on the data analysis goals but some of them include following:
- **Sanitize the data:** this data cleaning has to be done carefully to avoid introducing errors but its often a necessary step. It can involve dropping some unnecessary variables/columns. Renaming some columns to something which makes more sense. Dropping some observations. For instance, in this analysis where location and time-stamp is important, dropping all observations with no time-stamp and no location is required.
- **Create new variables:**. If necessary, this is also the time transform some variables from a format which is not convinient for your analysis. For instance, converting string time variables to datetime aware variables.
- **Combine datasets:** If you have more than one dataset, during preprocessing, you can also combine several datasets into one. For instance, we have the CDR transations which have no location details. We bring in the location details from another file.
- **Filtering based on columns and observations:** This can be done through any of the stages mentioned above but its worth mentioning that often, you may drop some columns which arent useful for your analysis. Also, you may drop some observations based on some conditions depending on your analysis needs.

Unlike in other data collection domains such as surveys where you can have standard data processing steps, in the data science space where your dataset can be anything, there are no hard and fast rule for preprocessing and data cleaning. It will be a case by case basis depending on your analysis goals. Also, preprocessing isnt necessarily a linear process: depending on what results you get downstream, you can go back and modify the preprocesisng steps. In this project, we have the ```preprocess_cdrs_using_spark``` which takes raw cdrs and saves to a CSV a processed dataset. Alternatively, we can return a Spark DataFrame.

In [None]:
def rename_sdf(df, mapper=None):
    ''' Rename column names of a dataframe
        mapper: a dict mapping from the old column names to new names
        Usage:
            df.rename({'old_col_name': 'new_col_name', 'old_col_name2': 'new_col_name2'})
            df.rename(old_col_name=new_col_name)
    '''
    for before, after in mapper.items():
        df = df.withColumnRenamed(before, after)
    
    return df

In [None]:
# ADD YOUR PROCESS FUNCTION HERE

In [None]:
# Use DATA_DIR and joinpath as its been used above to create
# full path for simulated_cdrs and loc file
loc_file = YOUR CODE
cdrs_dir = YOUR CODE
num_users = YOUR CODE
debug = True

# call preprocess_cdrs_using_spark here
# use cache() at the end of the like this preprocess_cdrs_using_spark.cache()
# Learn about what cache does using spark here:
# https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cache.html#pyspark.sql.DataFrame.cache
dfu = preprocess_cdrs_using_spark(file_or_folder=str(cdrs_dir), number_of_users_to_sample=num_users,
                                date_format='%Y%m%d%H%M%S',debug_mode=False, 
                                  loc_file=loc_file, save_to_csv=False).cache()

# Explolatory Data Analysis (EDA)
Whether the end result of your project is to produce a statistical report or 
to build a prediction model to be put in production, EDA is an essential stage in any data science project. EDA can be defined as 
the process of performing initial investigations on data so as to discover patterns,to spot anomalies,
to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
It is a good practice to understand the data first and try to gather as many insights from it. 
EDA is all about making sense of data before using the data for the intended use (e.g., build ML models, perfom statisitcal analysis). 

Again, there arent hard and fast rules on how to perfom EDA but some of the specific quesitons you would like to answer are as folloes:
- For each variable in the data, whats its distribution? Is it skewed? Whats its data type? Is it an approapriate  data type fopr my analysis. Are there any outliers?
- Whats the relationship between variables?

In this project, we will use the ```explore_data``` and explore what functions SPark has for basic EDA.

In [None]:
# ADD YOUR EXPLORE DATA FUNC HERE

## Project Task

### Generate more summary statistics
Please complete the function below.

In [4]:
def summary_stats_for_user_events(spark_df, out_stats):
    """
    In this function, the goal is to take a big Spark
    DataFrame, group users and count each users events, 
    convert to pandas DataFrame and generate summary stats
    :param: spark_df: preprocessed spark dataframe with data for multiple users
    :param: out_stats: CSV file path to save  the summary stats
    """
    # group user and count number of events
    # convert resulting spark dataframe to pandas
    pdf = YOUR CODE
    # change column "count" to num_events,remember that pdf is a pandas DataFrame
    YOUR CODE
    
    # generate summary stats using pandas describe() function
    # use property T to transpose the describe results and convert them
    # into a DataFrame like this: pd.DataFrame(transposed describe results).reset_index()
    # Dont forget to rese_index()
    pdf_sum_stats = YOUR CODE
    
    # remove the first row which has value "count"
    # you can use list indexing to achieve this
    pdf_sum_stats = YOUR CODE
    
    # Rename the column index into something informative. For instance, "Stat"
    YOUR CODE
    
    # Rename the percentiles in numbers to something better
    # first, declare a dict with old and new names
    # next, update the Stats column using the pd.Series.map() function
    YOUR CODE
    YOUR CODE
    
    print(*)
    # Now  save the summary stats to CSV
    YOUR CODE

In [None]:
# CALL YOUR summary_stats_for_user_events FUNCTION HERE, 
# REMEMBER THAT IT REQUIRES THE PREPROCESSED DF FROM 
# preprocess_cdrs_using_spark

###  Interpreting EDA results and applying them in downstreat work
Please answer the question below. You should answer in this same notebook.

**EDA-Question-1**: Given the distribution of number of events per user. When it comes to characterizing user mobility patterns such as number  of trips. Do you think we should utilize all users regardless of number of events?

A. Yes, we can use all users regardless of number of events.

B. No, we should filter out some users

**Your answer:**

**EDA-Question-2**: If you answered B, please answer the following questions on filtering?
1. How do determine threshold for filtering?
2. Which users do you filter out? 

# Generate individual based mobility patterns and attributes
As we noted in the project  instructions, the focus of this analysis is to understand mobility patterns 
on individual users. Although, generating trips  and understanding their distribution is crucial for this project, due to time constraints, we will start with simple mobility metrics. Namely:
- **Radius of gyration(Rg):**  For a single day, Rg can be defined in simple terms as the maximum distance a user travels. We can then compute ```avg_Rg``` based on all Rg from the user's data. This metric ```avg_Rg``` is what we will compute.
- **Number of unique locations visited everyday:** As the name suggests, this is simply, the count of unique locations an individual visits everyday. Given multiple days data, we will compute the ```avg_locs_per_day```

In addition to the mobility metric above, we will report the ```number of days``` a user was activive which will help us understand how much we should trust user data. 

For this task, we will utilize functions in the ```mob_data_utils``` module which were already created to generate  the required metrics above. You can import ```mob_data_utils``` like this to use my code: 
```import mob_data_utils as ut```

## Define functions to generate user attributes

In [None]:
def get_basic_user_mob_attributes(df):
        """
        In this funciton, we generate some basic user attributes 
        to help further explore the data and also report on 
        individual mobility metrics.
        :param df: Pandas DataFrame of single user data
        :return:
        """
        # get, datecol, x(lon), y(lat) from the MISC_PROCESSING_PARAMS variable
        datetimecol = YOUR CODE
        x =  YOUR CODE
        y =  YOUR CODE
        
        # use if condition to 
        if  'date' not in df.columns:
            # add date column in case its not  there
            # use the datetimecol to achieve this
            df['date'] = YOUR CODE
        
        # get a list of all  days/dates in ascending order
        # first, sort the dates and then get only unique dates
        dates = YOUR CODE
        # this dictionary will keep, for each, a count of unique locations visited
        # initialize this dict with dates as keys and values set to 0
        # you can use list comprehension idea though this is a dictionary
        # Hint, create a list  of unique dates when initializing this dict
        unique_locs_by_day = YOUR CODE
        
        # create a dictionary just like above but this one will keep
        # maximum distance travelled for each day
        dates_dist = YOUR CODE 
        
        # Loop through the dates_dist dictionary
        YOUR CODE 
            # Filter the input df so that we only get data for this date
            dfd = YOUR CODE 

            # get number of unique locations for this day
            uniq_xy = ut.va_generate_unique_locs(df=dfd, x=x, y=y)
            # add to the unique_locs_by_day dict, this date based on how
            # you initialized your loop, the value is the number of
            # unique locations visited
            YOUR CODE


            # distances travel
            if len(uniq_xy) > 1:
                dist_mtx = ut.va_distance_matrix(uniq_xy)
                # From the distance matrix above, get only columns with "to"
                # in it, use list comprehension with if condition
                req_cols = YOUR CODE
                # get max value from the distance matrix above
                # first, subset the dist_mtx DataDrame by selecting only req_cols
                # then, get values from the resulting DataFrame which you should
                # pass into np.max() function
                # put the resulting max value into the dates_dist dict with this 
                # date as key
                # this can be achieved in a single line of code or multiple lines
                YOUR CODE
            else:
                # if number of unique locations is less than or equal to 1
                # then set the value in the dates_dist dict accordingly
                 YOUR CODE
        
        # return dates_dist, unique_locs_by_day, number of days
        YOUR CODE

In [None]:
def generate_basic_user_attributes_with_pandas(df, outcsv, num_events_threshold=None):
    """
        In this funciton, we generate some basic user attributes 
        to help further explore the data and also report on 
        individual mobility metrics.
        :param df: Pandas DataFrame with multiple user data
        :return:
        """
    # get userid col name from  MISC_PROCESSING_PARAMS
    userid = YOUR CODE
    # generate a list of unique userid's
    user_list = YOUR CODE
    # initialize an empty list to hold user  data
    user_data = YOUR CODE
    
    # Loop through all users and generate their attributes
    YOUR CODE
        # Filter the input df so that we only get data for this user
        df_user = YOUR CODE
        
        if num_events_threshold:
            if df_user.shape[0] < num_events_threshold:
                continue
            else:
                # call the get_basic_user_mob_attributes function here
                YOUR CODE
        else:
             # call the get_basic_user_mob_attributes function here
                YOUR CODE
        
        # get the attributes
        # create a dictionary with the following keys: 'userid', 'usage_days', 'mean_locs_day'
        # use appropriate numpy functions to compute  mean as required and set them as values
        # in the dict
        user_att = YOUR CODE
        # add user_att to the user_data list
        YOUR CODE
    
    # create DataFrame using user_data and save it to file (2 lines of code)
    YOUR CODE

In [None]:
# CALL YOUR generate_basic_user_attributes_with_pandas() FUNCTION HERE, 
# FIRST, CONVERT THE DF FROM preprocess_cdrs_using_spark() TO PANDAS

## Create outputs from the CSV file of user attributes

### QUESTION-PART1: DEFINE A FUNCTION WITH FOLLOWING PROPERTIES:
- **Function name:** generate_outputs_from_csv()
- **Inputs:** csv file which you save from generate_basic_user_attributes_with_pandas function above
- **Inside the function:** use any plotting library (e.g., seaborn as we have used in this course) to generate a distribution plot of ```Radius of gyration(Rg)```.  Make sure your function shows the plot inline when it runs.
- **Function output:** The function doesnt have to return anything

### QUESTION-PART2: INTERPRET THE RESULTS
Write a few sentences to interpret the plot ```Radius of gyration(Rg)``` that you generated. Based on how it looks, do you think its normally distributed? 

### QUESTION: EXPLORE RELATIONSHIP BETWEEN ```avg_Rg``` and ```avg_locs_per_day```
Using the CSV that you saved above, perfom analyis (you dont have to use a function)
- Use a relevant plot to show relationship between the two variables
- Report the correlation coefficient between the two variables
- Write a few sentences to interpret the results