[View in Colaboratory](https://colab.research.google.com/github/todnewman/coe_training/blob/master/Basic_Data_Engineering_Complex_Files.ipynb)

# Handing Complex Data Files
Author: W. Tod Newman

## Learning Objectives


*   Learn how to identify and import open data files from the UC Irvine archive
*   Learn how to decode a zipped file from UCI
*   Evaluate a complex system of datafiles and determine how to blend into coherent features with associated targets

In this exercise we're working with the [Multisensor data fusion data set']('https://archive.ics.uci.edu/ml/machine-learning-databases/00366/AReM.zip').  This data is a nice benchmark in the area of activity recognition applications.  The classification task consists in predicting the activity performed by the user from time-series data generated by a Wireless Sensor Network.  The captured information comes from the implicit alteration of the wireless channel due to movements of the user.



In [0]:
import pandas as pd
import numpy as np
import urllib

In [0]:
import_file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00366/AReM.zip'

## Unzip the Data from the UCI site

This can sometimes be more challenging than one thinks.  The goal is to bring the data into a Pandas DataFrame where we can do work on it.  This is made difficult as each activity in the dataset is its own zip file and the target activity is inferred by the title of the file the data is captured in.

###What do we need to do?


1.   Figure out how to handle unzipping and grabbing each .csv file
2.   Determine where and how to convert the data into a DataFrame



In [0]:
import zipfile, urllib, csv, os, codecs
import urllib.request



def get_items(url):
    df = pd.DataFrame()
    df_summ = pd.DataFrame()
    # Download the file from `url` and save it locally under `file_name`:
    filename, headers = urllib.request.urlretrieve(url)
    
    # First we grab all the csvfiles in the unzipped container
    with zipfile.ZipFile(filename) as zf:
        csvfiles = [name for name in zf.namelist()
                    if name.endswith('.csv')]
        
        # For each .csv file, we'll first grab the data, then we'll
        #   append the filename as the Target.
        for item in csvfiles:
            df = pd.DataFrame(grab_data(item, zf)).dropna()
            df['Filename'] = item
            #  We need to append all the dataframes from each file into one
            #    large dataframe.
            df_summ = df_summ.append(df)
    return df_summ
        
def grab_data(item, zf):
    # Here, we actually do the open of the .csv file and return the lines
    #   as a Python generator (the yield function).  This will then allow
    #   us to pull the data into our DataFrame.
    try:
        with zf.open(item) as source:
            reader = csv.DictReader(codecs.getreader('iso-8859-1')(source))             
            for line in reader:
                yield line
    except Exception as error:
        print( "*Error: %s" % str(error) )


## Main Routine

Here, we gather the unzipped data into one master DataFrame


In [12]:
df = pd.DataFrame()

# Lets start off by unzipping the files and gathering the data into one large DataFrame
df = get_items(import_file)
keys = df.keys()

# We don't understand much about this data, let's do a bit of exploration
info = df.info()
print(info)

# Lets look at all the datatypes.
for i in range(len(keys)):
    print ("Data Type for %s is %s" % (keys[i],type(df[keys[i]])))

df_time = df[keys[0]] # Timestamp
df_file = df[keys[2]] # filename

df_time = pd.concat([df_time[0:3367]]*12, ignore_index=True) # we do this to line up the time series data

# The below is necessary because column 1 is a pandas series of data.  Need to turn that into separate columns.
newlist = []
data_list = df[keys[1]].tolist()
for val in data_list:
    newlist.append(val)

df_vals = pd.DataFrame(newlist, columns=['avg_rss12','var_rss12','avg_rss13', 'var_rss13', 'avg_rss23', 'var_rss23','name'])

df_file.reset_index(inplace=True, drop=True) # needed to be able to concatenate the two dataframes
df_time.reset_index(inplace=True, drop=True)
df_new = pd.concat([ df_time, df_vals, df_file], axis=1) #concatenate the data and the target
#df_new.to_csv('concat_UCI_multisensor_data_fusion.csv')
df_new

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41847 entries, 3 to 483
Data columns (total 9 columns):
# Task: bending1    3367 non-null object
None                41847 non-null object
Filename            41847 non-null object
# Task: bending2    2406 non-null object
# Task: cycling     7215 non-null object
# Task: lying       7215 non-null object
# Task: sitting     7214 non-null object
# Task: standing    7215 non-null object
# Task: walking     7215 non-null object
dtypes: object(9)
memory usage: 3.2+ MB
None
Data Type for # Task: bending1 is <class 'pandas.core.series.Series'>
Data Type for None is <class 'pandas.core.series.Series'>
Data Type for Filename is <class 'pandas.core.series.Series'>
Data Type for # Task: bending2 is <class 'pandas.core.series.Series'>
Data Type for # Task: cycling is <class 'pandas.core.series.Series'>
Data Type for # Task: lying is <class 'pandas.core.series.Series'>
Data Type for # Task: sitting is <class 'pandas.core.series.Series'>
Data Type for

Unnamed: 0,# Task: bending1,avg_rss12,var_rss12,avg_rss13,var_rss13,avg_rss23,var_rss23,name,Filename
0,# Columns: time,avg_rss12,var_rss12,avg_rss13,var_rss13,avg_rss23,var_rss23,,bending1/dataset1.csv
1,0,39.25,0.43,22.75,0.43,33.75,1.30,,bending1/dataset1.csv
2,250,39.25,0.43,23.00,0.00,33.00,0.00,,bending1/dataset1.csv
3,500,39.25,0.43,23.25,0.43,33.00,0.00,,bending1/dataset1.csv
4,750,39.50,0.50,23.00,0.71,33.00,0.00,,bending1/dataset1.csv
5,1000,39.50,0.50,24.00,0.00,33.00,0.00,,bending1/dataset1.csv
6,1250,39.25,0.43,24.00,0.00,33.00,0.00,,bending1/dataset1.csv
7,1500,39.25,0.43,24.00,0.00,33.00,0.00,,bending1/dataset1.csv
8,1750,39.00,0.00,23.75,0.43,33.00,0.00,,bending1/dataset1.csv
9,2000,39.50,0.50,24.00,0.00,33.00,0.00,,bending1/dataset1.csv
