<a href="https://colab.research.google.com/github/christophermalone/DSCI325/blob/main/Module4_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 4 - Part II:  Streaming Data + Python Function Writing

This module includes details for function writing in Python.  In addition, the data we will be working with is large and thus a streamming approach to summarizing the data will be used in this notebook

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Example 4.2.P
The Center for Medicare and Medicaid Services (CMS) publishes a complete datasets that consolidate the payment information submitted by reporting entities -- this is know as the Open Payments data. 

The following 7 fields will be considered here:

*   Physician_Profile_ID: Unique ID for each physician 
*   Physician_FirstName: First name of physician
*   Physician_LastName: Last name of physician
*   Recipient_Address: Address of payment recipient
*   Recipient_City: City of payment recipient
*   Recipient_State: State of payment recipient
*   Recipient_Zip_Code: Zipcode of payment recipient
*   Physician_Primary_Type: type of physician (NY and MN have variations in coding for this field)
*   Applicable_GPO_Name: Name of Group Purchasing Organization
*   Applicable_GPO_ID: Unique ID for Group Purchasing Organization
*   Number_of_Payments: Number of reported payments
*   Payment: Amount of payment
*   Date_of_Payment: Date of payment

Source:  https://www.cms.gov/OpenPayments/Data/Dataset-Downloads

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>



---



## Tidy up Directories

To **ls** bash command can be used to get file information in a folder in Colab. 

In [None]:
!ls /content/sample_data/ -l

total 55504
-rwxr-xr-x 1 root root     1697 Jan  1  2000 anscombe.json
-rw-r--r-- 1 root root   301141 Mar 23 14:22 california_housing_test.csv
-rw-r--r-- 1 root root  1706430 Mar 23 14:22 california_housing_train.csv
-rw-r--r-- 1 root root 18289443 Mar 23 14:22 mnist_test.csv
-rw-r--r-- 1 root root 36523880 Mar 23 14:22 mnist_train_small.csv
-rwxr-xr-x 1 root root      930 Jan  1  2000 README.md


The **rm** bash command can be used to remove files.  Here, only *.csv files are being removed.

In [None]:
!rm /content/sample_data/*.csv

Notice that all the *.csv file have been deleted from the /content/sample_data/ folder.

In [None]:
!ls /content/sample_data/ -l

total 8
-rwxr-xr-x 1 root root 1697 Jan  1  2000 anscombe.json
-rwxr-xr-x 1 root root  930 Jan  1  2000 README.md


The following variation will remove **ALL** files -- be careful is using this command as it will remove all files!

In [None]:
!rm /content/sample_data/*.*

Notice, all files have been removed in this folder.

In [None]:
!ls /content/sample_data/ -l

total 0


## Load Data

The next step is to upload the data for this notebook.  This is a large data file, so the data is given in as a zipped file.  In a workplace enviroment, you will likely pull this file from a local server.  The following command will retrieve a data file from a server and place this file into your Colab session.

In [None]:
#Create a file that contains the command to download the data file // location from StatsClass server
%%bash
{  
 echo 'wget -O  /content/sample_data/MedicarePayments_Physicians_Stream_Statsclass.zip https://www.statsclass.org/dsci325/datasets/MedicarePayments_Physicians_Stream.zip'
} > Download_MedicarePayments_Stream_StatsClass.sh

The following bash command will automatically download the file to the specified location.

In [None]:
# The bash commands have been commented out as I do *not* want to crash the StatsClass server
#%%bash
#bash Download_MedicarePayments_Stream_StatsClass.sh

Using the **ls** bash command again to see if the file was successfully downloaded from the StatClass server.  

In [None]:
!ls /content/sample_data/ -l

total 253088
-rw-r--r-- 1 root root 129578205 Apr  6 23:27 MedicarePayments_Physicians_Stream_Statsclass.zip
-rw-r--r-- 1 root root 129578205 Apr  7 15:29 MedicarePayments_Physicians_Stream.zip


**Comments**:  

*    For this session, I would like you to download the MedicarePayments_Physicians_Steam.zip file **directly from <a href="https://mnscu-my.sharepoint.com/:f:/g/personal/aq7839yd_minnstate_edu/EuupFPUefSRKux_7dtftWJ4BOToHKoC_JE8xysYtj3Epbw?e=FvoEQ1">OneDrive</a>**.  After downloading the file, upload this file into this sample_data folder within this Colab Session.

*    I was not able to directly download the data from OneDrive nor Google Drive into this Colab session.

## Unzip the Data

The **unzip** bash command can be used to unzip the desired data file.  The desired file is large, so unzipping this file may take some time.

In [None]:
!unzip -o /content/sample_data/MedicarePayments_Physicians_Stream.zip -d "/content/sample_data"

Archive:  /content/sample_data/MedicarePayments_Physicians_Stream.zip
  inflating: /content/sample_data/MedicarePayments_Physicians_Stream.csv  


Checking to see if the contents of the zipped file was successfully unpacked.  When unzipped the file is about 6 times larger.

In [None]:
!ls /content/sample_data/ -l

total 872368
-rw-r--r-- 1 root root 763717749 Apr  6 18:15 MedicarePayments_Physicians_Stream.csv
-rw-r--r-- 1 root root 129578205 Apr  7 15:29 MedicarePayments_Physicians_Stream.zip


Recall, the **wc** bash command with the  **-l** option will count the number of records in this large file.  About 5.7 million records are contained in this file.

In [None]:
!wc -l /content/sample_data/MedicarePayments_Physicians_Stream.csv

5721010 /content/sample_data/MedicarePayments_Physicians_Stream.csv


Taking a look at the first few records using the **head** command.

In [None]:
!head /content/sample_data/MedicarePayments_Physicians_Stream.csv

557946,Vikas,Pilly,400 International Drive,Williamsville,NY,14221-5771,Medical,Caerus Corp.,100000246826,1,14875,08/19/2020
276936,Matthew,Hall,301 Prospect Avenue,Syracuse,NY,13203,Doctor of Dentistry,Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC,100000151564,1,5296,06/16/2020
1275463,Stephen,Campbell,228 Ravine Road,Hinsdale,IL,60521,Doctor of Dentistry,Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC,100000151564,1,1050,08/02/2020
268352,Leroy,McCarty,8100 W. 78th Street,Edina,MN,55439-2516,Medical,Caerus Corp.,100000246826,1,4750,03/19/2020
904225,Michael,Latza,212 Park Avenue,Plainfield,NJ,07060-1206,Chiropractor,Caerus Corp.,100000246826,1,320,10/13/2020
995636,MEREDITH,CARBONE-DOYLE,734 N MAIN ST,LACONIA,NH,03247,Doctor of Osteopathy,Mission Pharmacal Company,100000000186,1,14.31,02/20/2020
324692,JOSEPH,BEALS,390 PARK ST,BIRMINGHAM,MI,48009,Medical Doctor,Mission Pharmacal Company,100000000186,1,16.47,01/09/2020
259992,ANDREA,BAYER,10115 FOREST

Checking the last few lines of this large data file using the **tail** bash command.

In [None]:
!tail /content/sample_data/MedicarePayments_Physicians_Stream.csv

1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,16009,09/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,9591,08/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,1958,07/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,326,05/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,3264,11/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,2448,12/01/2020
1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,4896,10/01/2020
75907,JACKIE,FRENCH,223 E 34th St,New York,NY,10016-4852,Medical,Biogen Inc.,100000000193,1,831.25,05/11/2020
829368,BRAD,VAUGHN,101 Manning Dr Fl 1,Chapel Hill,NC,27

# Data Processing via Streaming

## Reading Data in Line-by-Line

Before reading in data, you must give Python access to the file.  This notebook includes the use of the **open()** function to accomplish this task. Open returns an object, i.e. file object, which includes methods that make it easier to read in data, etc. 

In addition, this notebook takes advantage of the **with** statement has simplier syntax and handles exceptions/errors better. 

<table width='75%'>
  <tr>
    <td width='50%' valign="top">
      <font size="+1">
      <p align='center'><strong>Traditional</strong></p>
      file = open("mydata.txt")<br>
         &nbsp;&nbsp;&nbsp;data = file.read()<br>
         &nbsp;&nbsp;&nbsp;print data<br>
      file.close()</font> # It's important to close the file<br>
    </td>
    <td width='50%' valign="top">
      <font size="+1">
      <p align='center'><strong>Using with</strong></p>
      with open("mydata.txt")<br>
        &nbsp;&nbsp;&nbsp;data = file.read()<br>
       </font>
       <br><br>
    </td>
  </tr>
</table>

Consider the following use of the open() function along with the with statement.  This code will consider one line at a time -- which is necessary when data processing is done via streaming.  The **if** statement is needed so that printing happens only for the 1st five lines.

In [16]:
# Reading in data one line at a time - showing the first 5 lines
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index < 5:
         print(index,line)

0 557946,Vikas,Pilly,400 International Drive,Williamsville,NY,14221-5771,Medical,Caerus Corp.,100000246826,1,14875,08/19/2020

1 276936,Matthew,Hall,301 Prospect Avenue,Syracuse,NY,13203,Doctor of Dentistry,Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC,100000151564,1,5296,06/16/2020

2 1275463,Stephen,Campbell,228 Ravine Road,Hinsdale,IL,60521,Doctor of Dentistry,Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC,100000151564,1,1050,08/02/2020

3 268352,Leroy,McCarty,8100 W. 78th Street,Edina,MN,55439-2516,Medical,Caerus Corp.,100000246826,1,4750,03/19/2020

4 904225,Michael,Latza,212 Park Avenue,Plainfield,NJ,07060-1206,Chiropractor,Caerus Corp.,100000246826,1,320,10/13/2020



The following modification can be made to print the last five lines of this large data file.  Recall, the number of records in this data file is 5721010.

In [17]:
# Reading in data one line at a time - showing the last 5 lines
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index > (5721010-5):
         print(index,line)

5721006 1209319,ANDERS,CARLSON,401 PHALEN BLVD,SAINT PAUL,MN,55130,Medical,Abbott Laboratories,100000010680,1,4896,10/01/2020

5721007 75907,JACKIE,FRENCH,223 E 34th St,New York,NY,10016-4852,Medical,Biogen Inc.,100000000193,1,831.25,05/11/2020

5721008 829368,BRAD,VAUGHN,101 Manning Dr Fl 1,Chapel Hill,NC,27514-4220,Medical Doctor,Biogen Inc.,100000000193,1,1000,02/13/2020

5721009 141838,Kenneth,Rosenfield,55 FRUIT ST # 800MAILSTOP 843,Boston,MA,02114,Medical Doctor,AngioDynamics Inc.,100000005504,1,46500,02/19/2020



The following modification can be made to show every 1000000th line.  Modular arithmatic is used to accomplish this in Python and a modular arithmatic can be computed with a \% sign.

In [18]:
# Reading in data one line at a time - showing every 1000000th line
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index % 1000000 == 0:
         print(index,line)

0 557946,Vikas,Pilly,400 International Drive,Williamsville,NY,14221-5771,Medical,Caerus Corp.,100000246826,1,14875,08/19/2020

1000000 331374,MARIA,BACCORO,30 E RIVER PARK PL W,FRESNO,CA,93720-1545,Medical Doctor,Allergan Inc.,100000000278,1,3.75,02/21/2020

2000000 300504,DEEPINDER,BURN,975 HALL ST,WIGGINS,MS,39577,Medical Doctor,GlaxoSmithKline LLC.,100000005449,1,15.02,03/31/2020

3000000 91560,DANIEL,ZUCKERBROD,14400 WEST MCNICHOLS,DETROIT,MI,48235-3916,Medical Doctor,Regeneron Pharmaceuticals Inc.,100000136416,1,34.71,09/23/2020

4000000 158584,LUIS,ZUNIGA MONTES,4511 HORIZON HILL BLVD,SAN ANTONIO,TX,78229,Medical Doctor,Horizon Therapeutics plc,100000131389,1,13.36,09/11/2020

5000000 1055948,SUSHMA,HIRANI,2944 HUNTER MILL RD STE 101,OAKTON,VA,22124,Medical Doctor,Romark Laboratories LC,100000011162,1,1.69,01/09/2020



## Processing each Line

The delimiter in this file is a comma, i.e this is a comma seperated values file.  Thus; to process each line the information will be split by a comma.  The following code reads in a line and splits the information into a Python list.  Putting each line into a list will allow us to easily pick-off the desired information from each line.

In [19]:
# Reading in data one line at a time
# Putting the information from each line into a list by splitting on the comma
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index < 10:
         print(index,line.split(","))

0 ['557946', 'Vikas', 'Pilly', '400 International Drive', 'Williamsville', 'NY', '14221-5771', 'Medical', 'Caerus Corp.', '100000246826', '1', '14875', '08/19/2020\n']
1 ['276936', 'Matthew', 'Hall', '301 Prospect Avenue', 'Syracuse', 'NY', '13203', 'Doctor of Dentistry', 'Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC', '100000151564', '1', '5296', '06/16/2020\n']
2 ['1275463', 'Stephen', 'Campbell', '228 Ravine Road', 'Hinsdale', 'IL', '60521', 'Doctor of Dentistry', 'Solvay Dental 360-A Div of Solvay Specialty Polymers USA LLC', '100000151564', '1', '1050', '08/02/2020\n']
3 ['268352', 'Leroy', 'McCarty', '8100 W. 78th Street', 'Edina', 'MN', '55439-2516', 'Medical', 'Caerus Corp.', '100000246826', '1', '4750', '03/19/2020\n']
4 ['904225', 'Michael', 'Latza', '212 Park Avenue', 'Plainfield', 'NJ', '07060-1206', 'Chiropractor', 'Caerus Corp.', '100000246826', '1', '320', '10/13/2020\n']
5 ['995636', 'MEREDITH', 'CARBONE-DOYLE', '734 N MAIN ST', 'LACONIA', 'NH', '03247',

Next, suppose the goal is to obtain the **Payment** amount from each line.  The Payment field is the 11th field in this list.  The following code will retrieve the Payment from each line in this large file.

In [21]:
# Reading in data one line at a time
# Putting the information from each line into a list by splitting on the comma
# Retrieving the Payment, i.e the 11th item in this list
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index < 5:
         print(index,line.split(",")[11])

0 14875
1 5296
2 1050
3 4750
4 320


<table width='100%'><tr><td bgcolor='orange' align='center'><font size="+2">Python Function</font></td></tr></table>

In [22]:
def getItem(input_line, which_field=1, return_string=True):
  '''  Purpose: Gets a particluar field from an input line 

       Args:
         input_line: line to process
         which_field: identifies which field to retrieve
         return_string: a logicial to identify whether or not the item should be returned as a string
        
      Returns: the desired item from the input_line list 
  '''
  
  item = str(input_line.split(",")[which_field])
  if return_string == False:
    item = float(item)

  return(item)

<table width='100%'><tr><td bgcolor='orange' align='center'><font size="+2">&nbsp;</font></td></tr></table>

Next, using the getItem() custom function to retrieve the desired items from each line.

In [24]:
# Reading in data one line at a time
# Using the getItem() custom function to retrieve desired information
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index < 5:
         print(index,
               getItem(line, which_field=11, return_string=False)
               )

0 14875.0
1 5296.0
2 1050.0
3 4750.0
4 320.0


Before continuing, let's test the functionality of the return_string argument of the getItem() custom function.

In [28]:
# Reading in data one line at a time
# Using the getItem() custom function to retrieve desired information
# Testing the return_string argument of this function
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
       if index < 5:
         print(index,
               type(getItem(line, which_field=11, return_string=False))
               )

0 <class 'str'>
1 <class 'str'>
2 <class 'str'>
3 <class 'str'>
4 <class 'str'>


## Data Summaries via Streaming - SUM()

Consider the following code that will compute the sum of all payments across all records.  The streaming approach used here does **not** require that the entire data set be loaded into Python.

Note:  The print() statement should be judiciously to help you understand how the data processing is being done.  However, be careful to **not** print each line when processing all lines!

In [32]:
#Initialize SumPayments and Counter
SumPayments = 0.; Counter=0;

#Read data in one line at a time
#Use the getItem() custom function to return the Payment field
#Judicious use of the print() statement will help identify errors in your processing

with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
      if index < 5:
        Counter += 1
        SumPayments += getItem(line, which_field=11, return_string=False)
      if index < 5:
         print(index, Counter, getItem(line, which_field=11, return_string=False), round(SumPayments,2))

0 1 14875.0 14875.0
1 2 5296.0 20171.0
2 3 1050.0 21221.0
3 4 4750.0 25971.0
4 5 320.0 26291.0


Consider the following code which includes additional print statments that can be used when considering all 5.7 millions records.

In [47]:
#Initialize SumPayments and Counter
SumPayments = 0.; Counter=0;

#Read data in one line at a time
#Use the getItem() custom function to return the Payment field
#Judicious use of the print() statement will help identify processing steps

with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
      if index < 5:
        Counter += 1
        SumPayments += getItem(line, which_field=11, return_string=False)
        if index < 5:
           print(Counter, round(SumPayments,2))
        #if index % 1000000 == 0:
        # print(Counter, round(SumPayments,2))

#Print the outcomes at the very end
print("Final Outcomes: " + str(Counter) + " | " + str(round(SumPayments,2)))

1 14875.0
2 20171.0
3 21221.0
4 25971.0
5 26291.0
Final Outcomes: 5 | 26291.0


## Data Summaries via Streaming - Top N List

Next, consider a situation in which a Top N list is desired. For example, a Top 3 List or Top 5 List regarding payments.

The data processing steps required here at more complicated that computing a simple sum from each line.  An example (in a spreadsheet) is provided here.

Example:  <a href="https://docs.google.com/spreadsheets/d/1DiX0pm18fXm9iQ-zaFtAwlo5riDsly1e58ZQFdo0tGs/edit?usp=sharing">Top 3 Spreadsheet</a>

The pandas library will be used to accomplish this task, so let's load this package first.

In [49]:
import pandas as pd

The first step is to initialize a data.frame that will be used for the desired outcomes.  The data.frame is initialized with NaN values as is shown here.

In [56]:
Top5_df = pd.DataFrame(columns=['Value', 'Name'],
                           index=range(0, 5)
                           )

print(Top5_df)

  Value Name
0   NaN  NaN
1   NaN  NaN
2   NaN  NaN
3   NaN  NaN
4   NaN  NaN


<table width='100%'><tr><td bgcolor='orange' align='center'><font size="+2">Python Function</font></td></tr></table>

In [58]:
def TopNList(TopN_df, input_value, input_name):
  '''  Purpose: To process a Top N list of values for a streaming data situation

       Args:
         TopN_df: the exising data.frame that contains the top N quantities
         input_value: incoming data value
         input_name: name associated with incoming data value
        
      Returns: 
         data.frame that contains the top N values and the associated names
  '''
  
  if TopN_df['Value'].isnull().values.any() or input_value > TopN_df.loc[0, "Value"]:
    TopN_df.loc[0, "Value"] = input_value
    TopN_df.loc[0, "Name"] = input_name
    TopN_df.sort_values(by='Value', ascending=True, na_position='first', ignore_index = True, inplace = True)
 
  return(TopN_df)

<table width='100%'><tr><td bgcolor='orange' align='center'><font size="+2">&nbsp;</font></td></tr></table>

In [66]:
#Initialize a data.frame for the outcomes
Top5_df = pd.DataFrame(columns=['Value', 'Name'],
                           index=range(0, 5)
                           )

#Stream through the data to obtain the top N list
with open("/content/sample_data/MedicarePayments_Physicians_Stream.csv", "r") as file:

    for index, line in enumerate(file):
      if index < 5:
        Top5_df = TopNList(Top5_df, 
                           input_value = getItem(line, which_field=11, return_string=False), 
                           input_name = getItem(line, which_field=0, return_string=True)
                           )
        if index < 5:
            print(Top5_df)

        #if index % 1000000 == 0:
        #   print(Top5_df)

print(Top5_df)

     Value    Name
0      NaN     NaN
1      NaN     NaN
2      NaN     NaN
3      NaN     NaN
4  14875.0  557946
     Value    Name
0      NaN     NaN
1      NaN     NaN
2      NaN     NaN
3   5296.0  276936
4  14875.0  557946
     Value     Name
0      NaN      NaN
1      NaN      NaN
2   1050.0  1275463
3   5296.0   276936
4  14875.0   557946
     Value     Name
0      NaN      NaN
1   1050.0  1275463
2   4750.0   268352
3   5296.0   276936
4  14875.0   557946
     Value     Name
0    320.0   904225
1   1050.0  1275463
2   4750.0   268352
3   5296.0   276936
4  14875.0   557946
     Value     Name
0    320.0   904225
1   1050.0  1275463
2   4750.0   268352
3   5296.0   276936
4  14875.0   557946




---



---
End of Document
