In [1]:
import pandas as pd
import numpy as np

### Part 1: Folders and files

Rather than using a point-click approach to folders and files, we are going to let Python do the work for us.  We will focus on the `os` module, but there are others out there that you should take a look at (e.g., `glob`).

In [2]:
import os

First, let's understand how Python maps folders in your computer.


In [3]:
# Current working directory
cwd = os.getcwd()
cwd

'/Users/ws/Projects/study_books/07_Programing_and_Data/Week_03'

An alternative is to simply open the folder where this notebook is stored using Explorer and copy the url.  Then create a variable, `url1`, and use `r` to capture the full url as a raw string.

In [4]:
url1 = r'/Users/ws/Projects/study_books/07_Programing_and_Data/Week_03'
url1

'/Users/ws/Projects/study_books/07_Programing_and_Data/Week_03'

Let's take a look at what is contained in the current working directory.  For that we'll use: `os.listdir()`

In [5]:
os.listdir(cwd)

['.DS_Store',
 'MFIN290_Folders_and_Functions.ipynb',
 'MFIN290_Grouping_and_Summary_Stats.ipynb',
 'MFIN290_Missing_Data.ipynb',
 'MFIN290_Stacking_and_Merging.ipynb',
 'fina.csv',
 'Earnings.zip',
 'MFIN290_Reading_and_Filtering_Data.ipynb',
 'MFIN290_HW3.ipynb']

We need to extract the files in 'Earnings.zip'.  For that, we'll use the `zipfile` module.

In [6]:
from zipfile import ZipFile

In [7]:
with ZipFile('Earnings.zip','r') as zipObj:
           zipObj.extractall()

The files from 'Earnings.zip' have been extracted to a new folder called "Earnings."  Let's see what's in that folder.  There are several ways to do this, but we'll use `os.listdir()`, which lists all files in a given directory.



In [8]:
os.listdir('Earnings')

['goog_earnings.csv',
 'msft_earnings.csv',
 'aapl_earnings.csv',
 'nflx_earnings.csv',
 'fb_earnings.csv']

### Part 2: Learning functions

In this part, we'll illustrate the power of user-created functions with a single CSV file.  Let's pull in the earnings file for Apple.

In [9]:
aapl = pd.read_csv('Earnings/aapl_earnings.csv')

In [10]:
aapl.head(10)

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise(%)
0,AAPL,Apple Inc.,"Jul 30, 2020, 12 AMEDT",2.04,2.58,26.22
1,AAPL,Apple Inc.,"Apr 30, 2020, 12 AMEDT",2.26,2.55,12.78
2,AAPL,Apple Inc.,"Jan 28, 2020, 12 AMEDT",4.55,4.99,9.74
3,AAPL,Apple Inc.,"Oct 30, 2019, 12 AMEDT",2.84,3.03,6.73
4,AAPL,Apple Inc.,"Jul 30, 2019, 12 AMEDT",2.1,2.18,3.91
5,AAPL,Apple Inc.,"Apr 30, 2019, 12 AMEDT",2.36,2.46,4.19
6,AAPL,Apple Inc.,"Jan 29, 2019, 12 AMEDT",4.17,4.18,0.31
7,AAPL,Apple Inc.,"Nov 01, 2018, 12 AMEDT",2.78,2.91,4.56
8,AAPL,Apple Inc.,"Jul 31, 2018, 12 AMEDT",2.18,2.34,7.29
9,AAPL,Apple Inc.,"May 01, 2018, 12 AMEDT",2.67,2.73,2.13


### User-Created Functions

These functions can be *anything* you want them to be.  Therein lies their power.  We will look at several user-created functions, just to illustrate the basic principles and syntax.

**User Function #1: Using `apply()`**

`apply()` is a pandas method that applies a function along an axis of a DataFrame.  The primary parameter of `apply()` is a function.  

Let's use `apply` over the built-in `len` function, in order to to figure out the length of the company name.



In [11]:
aapl['name_length'] = aapl['Company'].apply(len)

In [12]:
aapl.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise(%),name_length
0,AAPL,Apple Inc.,"Jul 30, 2020, 12 AMEDT",2.04,2.58,26.22,10
1,AAPL,Apple Inc.,"Apr 30, 2020, 12 AMEDT",2.26,2.55,12.78,10
2,AAPL,Apple Inc.,"Jan 28, 2020, 12 AMEDT",4.55,4.99,9.74,10
3,AAPL,Apple Inc.,"Oct 30, 2019, 12 AMEDT",2.84,3.03,6.73,10
4,AAPL,Apple Inc.,"Jul 30, 2019, 12 AMEDT",2.1,2.18,3.91,10


**User Function #2: Meet/Beat Earnings Surprises**

Let's create our own function -- we'll call it `meet_or_beat`.  The goal of this function is to identify earnings surprises that:
  - just beat the analyst forecast by a single penny,
  - exactly met the forecast, or 
  - just missed the analyst forecast by a single penny. 


In [13]:
# first, create the `surprise` variable
aapl['surprise'] = aapl['Reported EPS'] - aapl['EPS Estimate'] 

In [14]:
aapl.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise(%),name_length,surprise
0,AAPL,Apple Inc.,"Jul 30, 2020, 12 AMEDT",2.04,2.58,26.22,10,0.54
1,AAPL,Apple Inc.,"Apr 30, 2020, 12 AMEDT",2.26,2.55,12.78,10,0.29
2,AAPL,Apple Inc.,"Jan 28, 2020, 12 AMEDT",4.55,4.99,9.74,10,0.44
3,AAPL,Apple Inc.,"Oct 30, 2019, 12 AMEDT",2.84,3.03,6.73,10,0.19
4,AAPL,Apple Inc.,"Jul 30, 2019, 12 AMEDT",2.1,2.18,3.91,10,0.08


In [15]:
# Let's practice creating a user-created function
def temp_function(num):
    if num > 0:
        value = 'positive'
    elif num == 0:
        value = 'zero'
    else:
        value = 'negative'
    return value

In [16]:
temp_variable = temp_function(3)
temp_variable

'positive'

In [17]:
# next, let's create our own function:
def meet_or_beat(row):
    if 0  < row['surprise'] <= 0.01:
        value = 'justbeat'
    elif row['surprise'] ==0:
        value = 'exactlymeet'
    elif -0.01  <= row['surprise'] < 0:
        value = 'justmiss'
    else:
        value= 'none'
    return value



In [18]:
# By setting axis = 1, we're applying meet_or_beat row-by-row
aapl['meet_beat']=aapl.apply(meet_or_beat, axis=1)
aapl.head(15)

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise(%),name_length,surprise,meet_beat
0,AAPL,Apple Inc.,"Jul 30, 2020, 12 AMEDT",2.04,2.58,26.22,10,0.54,none
1,AAPL,Apple Inc.,"Apr 30, 2020, 12 AMEDT",2.26,2.55,12.78,10,0.29,none
2,AAPL,Apple Inc.,"Jan 28, 2020, 12 AMEDT",4.55,4.99,9.74,10,0.44,none
3,AAPL,Apple Inc.,"Oct 30, 2019, 12 AMEDT",2.84,3.03,6.73,10,0.19,none
4,AAPL,Apple Inc.,"Jul 30, 2019, 12 AMEDT",2.1,2.18,3.91,10,0.08,none
5,AAPL,Apple Inc.,"Apr 30, 2019, 12 AMEDT",2.36,2.46,4.19,10,0.1,none
6,AAPL,Apple Inc.,"Jan 29, 2019, 12 AMEDT",4.17,4.18,0.31,10,0.01,justbeat
7,AAPL,Apple Inc.,"Nov 01, 2018, 12 AMEDT",2.78,2.91,4.56,10,0.13,none
8,AAPL,Apple Inc.,"Jul 31, 2018, 12 AMEDT",2.18,2.34,7.29,10,0.16,none
9,AAPL,Apple Inc.,"May 01, 2018, 12 AMEDT",2.67,2.73,2.13,10,0.06,none


In [19]:
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Symbol         89 non-null     object 
 1   Company        89 non-null     object 
 2   Earnings Date  89 non-null     object 
 3   EPS Estimate   89 non-null     float64
 4   Reported EPS   89 non-null     float64
 5   Surprise(%)    89 non-null     float64
 6   name_length    89 non-null     int64  
 7   surprise       89 non-null     float64
 8   meet_beat      89 non-null     object 
dtypes: float64(4), int64(1), object(4)
memory usage: 6.4+ KB


**User Function #3: Clean up dates**

Notice that the date variable, `Earnings Date`, is a string object that has extraneous information.  Let's create our own function to clean up the variable and create a new variable that is a datetime object.  We will use `split()` and `datetime.strptime` to do so.

In [20]:
# an example to show the point
date1 = 'Jul 30, 2020, 12 AMEDT'
date1

'Jul 30, 2020, 12 AMEDT'

In [21]:
#the split function creates a list, splitting the string on a character.
date1.split(',')

['Jul 30', ' 2020', ' 12 AMEDT']

In [22]:
#let's extract an element from that list using slice notation
date1.split(',')[0]

'Jul 30'

In [23]:
date1.split(',')[1]

' 2020'

In [24]:
date2 = date1.split(',')[0] + date1.split(',')[1]
date2

'Jul 30 2020'

In [25]:
import datetime as dt
fixed_date = dt.datetime.strptime(date2, '%b %d %Y')

In [26]:
fixed_date

datetime.datetime(2020, 7, 30, 0, 0)

In [27]:
# Let's create a function that includes all the steps before
def convert_date(date):
    fixed_date = date.split(',')[0]+ date.split(',')[1]
    fixed_date = dt.datetime.strptime(fixed_date, '%b %d %Y')
    return fixed_date

In [28]:
convert_date(date1)

datetime.datetime(2020, 7, 30, 0, 0)

Now let's apply our own function, `convert_date`, to the `aapl` DataFrame.

In [29]:
aapl['edate'] = aapl['Earnings Date'].apply(convert_date)

In [30]:
aapl.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise(%),name_length,surprise,meet_beat,edate
0,AAPL,Apple Inc.,"Jul 30, 2020, 12 AMEDT",2.04,2.58,26.22,10,0.54,none,2020-07-30
1,AAPL,Apple Inc.,"Apr 30, 2020, 12 AMEDT",2.26,2.55,12.78,10,0.29,none,2020-04-30
2,AAPL,Apple Inc.,"Jan 28, 2020, 12 AMEDT",4.55,4.99,9.74,10,0.44,none,2020-01-28
3,AAPL,Apple Inc.,"Oct 30, 2019, 12 AMEDT",2.84,3.03,6.73,10,0.19,none,2019-10-30
4,AAPL,Apple Inc.,"Jul 30, 2019, 12 AMEDT",2.1,2.18,3.91,10,0.08,none,2019-07-30


In [31]:
aapl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Symbol         89 non-null     object        
 1   Company        89 non-null     object        
 2   Earnings Date  89 non-null     object        
 3   EPS Estimate   89 non-null     float64       
 4   Reported EPS   89 non-null     float64       
 5   Surprise(%)    89 non-null     float64       
 6   name_length    89 non-null     int64         
 7   surprise       89 non-null     float64       
 8   meet_beat      89 non-null     object        
 9   edate          89 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(1), object(4)
memory usage: 7.1+ KB


### Part 3: Looping over files and applying functions

We have several files that we want to pull together.  Let's loop over the files and apply the functions we've created as we do so.

In [32]:
csvlist = os.listdir('Earnings')
csvlist

['goog_earnings.csv',
 'msft_earnings.csv',
 'aapl_earnings.csv',
 'nflx_earnings.csv',
 'fb_earnings.csv']

Let's create a user-created function, `get_csv`, that aggregates all our tasks into one function.  All at once, this function will:
 - bring in the related `csv` file
 - remove rows with missing values
 - create the `surprise` variable
 - apply the previously-defined `meet_beat` function
 - apply the previously-defined `convert_date` function
 - subset the data to retain only the variables we want


In [33]:
# Let's create a function that does all the work 
def get_csv(file):
    earnings = pd.read_csv('Earnings/'+file, index_col=None, header=0) # Read csv
    earnings = earnings.dropna() # Drop missing values
    earnings['surprise'] = earnings['Reported EPS'] - earnings['EPS Estimate'] # Create a new variable surprise
    earnings['meet_beat'] = earnings.apply(meet_or_beat, axis=1) # Apply meet_or_beat() function to create a new variable
    earnings['edate'] = earnings['Earnings Date'].apply(convert_date) # Apply convert_date() function to create a new variable
    earnings = earnings[['Symbol','edate','EPS Estimate','Reported EPS', 'surprise','meet_beat']] # Keep a subset of variables
    return earnings
    
    

We need to run the `get_csv` function over each csv file in the directory.  Let's do so with a *for-loop*.

In [34]:
#create an empty list
dfs = []

# use a for-loop to iterate over each element, i, in the list of csv files, csvlist.
for i in csvlist:
    csv = get_csv(i)
    dfs.append(csv)
    

In [35]:
#dfs is a list container, which contains each of the individual csv data separately
dfs[4]

Unnamed: 0,Symbol,edate,EPS Estimate,Reported EPS,surprise,meet_beat
0,FB,2020-07-30,1.39,1.8,0.41,none
1,FB,2020-04-29,1.76,1.71,-0.05,none
2,FB,2020-01-29,2.53,2.56,0.03,none
3,FB,2019-10-30,1.91,2.12,0.21,none
4,FB,2019-07-24,1.88,0.91,-0.97,none
5,FB,2019-04-24,1.63,0.85,-0.78,none
6,FB,2019-01-30,2.19,2.38,0.19,none
7,FB,2018-10-30,1.47,1.76,0.29,none
8,FB,2018-07-25,1.72,1.74,0.02,none
9,FB,2018-04-25,1.35,1.69,0.34,none


In [36]:
#let's concatenate all the data in `dfs` into a single DataAFrame
earnings1 = pd.concat(dfs, axis=0, ignore_index=True)

In [37]:
earnings1.tail()

Unnamed: 0,Symbol,edate,EPS Estimate,Reported EPS,surprise,meet_beat
300,FB,2013-07-24,0.14,0.19,0.05,none
301,FB,2013-05-01,0.13,0.12,-0.01,none
302,FB,2013-01-30,0.15,0.17,0.02,none
303,FB,2012-10-23,0.11,0.12,0.01,justbeat
304,FB,2012-07-26,0.12,0.12,0.0,exactlymeet


In [38]:
earnings1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 305 entries, 0 to 304
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Symbol        305 non-null    object        
 1   edate         305 non-null    datetime64[ns]
 2   EPS Estimate  305 non-null    float64       
 3   Reported EPS  305 non-null    float64       
 4   surprise      305 non-null    float64       
 5   meet_beat     305 non-null    object        
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 14.4+ KB


In [39]:
earnings1.to_csv('earnings1.csv', index=False)

### Part 4: using Python to create/adjust/move folders

#### 4.1 Create a new folder

In this step, we'll create a new folder on our computer and save *earnings1.csv* to that folder. 

In [40]:
to_directory = os.getcwd() +'/Finished_Earnings/'
to_directory

'/Users/ws/Projects/study_books/07_Programing_and_Data/Week_03/Finished_Earnings/'

In [41]:
# If the folder doesn't exist, create a new one
if not os.path.exists(to_directory):
    os.mkdir(to_directory)

In [42]:
from_directory = os.getcwd() + '/Earnings/'
from_directory

'/Users/ws/Projects/study_books/07_Programing_and_Data/Week_03/Earnings/'

In [43]:
os.listdir(from_directory)

['goog_earnings.csv',
 'msft_earnings.csv',
 'aapl_earnings.csv',
 'nflx_earnings.csv',
 'fb_earnings.csv']

In [44]:
import shutil

In [45]:
files = os.listdir(from_directory)
files

['goog_earnings.csv',
 'msft_earnings.csv',
 'aapl_earnings.csv',
 'nflx_earnings.csv',
 'fb_earnings.csv']

In [46]:
move_df = pd.DataFrame({'filename':files})
move_df


Unnamed: 0,filename
0,goog_earnings.csv
1,msft_earnings.csv
2,aapl_earnings.csv
3,nflx_earnings.csv
4,fb_earnings.csv


In [47]:
# loop over a DataFrame using iterrows()
for index,row in move_df.iterrows():
   filename = row['filename'] # Take each file name in move_df
   source = from_directory+filename # Source directory
   dest = to_directory+filename # Destination directory
   shutil.move(source, dest) # Move each file

In [48]:
os.listdir(from_directory)

[]

In [49]:
os.listdir(to_directory)

['goog_earnings.csv',
 'msft_earnings.csv',
 'aapl_earnings.csv',
 'nflx_earnings.csv',
 'fb_earnings.csv']