## 1.0  Reading CSV File

We will use pandas, a package in python that allows data to be represented in easily manipulated strcutures.  One of the most useful is a dataframe, where data are arranged in rows (as observations) and columns (variables), similar to what we are used to in the EXCEL spreasheet.

Below are the commands for reading the CSV file.

Unfortuately, the error message indicates that the data cannot be read as the file is not in the default 'utf-8' format.

In [8]:
import pandas as pd
stock = pd.read_csv("STI_Index.csv")
stock

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4: invalid start byte

We use the function from the chardet module to determine how the file we want to read in is encoded.  It turns out toe be 'ISO-8859-1'

In [9]:
import chardet
fh = open("STI_Index.csv", "rb").read()
chardet.detect(fh)

{'confidence': 0.73, 'encoding': 'ISO-8859-1', 'language': ''}

Now all we have to do is to include encoding as one of the arguments that we pass to the pd.read_csv() method as illustrated below. Note that we end up with a neatly arranged table.  The python terminology for the table created is a DataFrame or more specifically a pandas DataFrame

In [14]:
import pandas as pd
stock = pd.read_csv("STI_Index.csv", encoding = "ISO-8859-1")
stock

Unnamed: 0,Counter_Name,Code,Last,Chg,PercentChange,Vol,B Vol,Buy,Sell,S Vol,High,Low,Value,Sector
0,OCBC Bank,O39,11.68,-,-,2824.5,21.2,11.67,11.68,15.7,11.76,11.61,32993891.0,FIN
1,Ascendas Reit,A17U,2.63,0.01,0.382,2382.5,425.5,2.62,2.63,198.6,2.64,2.61,6245997.5,PROP
2,CapitaLand,C31,3.18,-0.02,-0.625,4501.2,569.0,3.18,3.19,722.7,3.21,3.17,14339453.0,PROP
3,CapitaMall Trust,C38U,2.02,-0.02,-0.98,5270.2,3126.3,2.02,2.03,1173.5,2.04,2.01,10677863.6,PROP
4,CityDev,C09,11.02,-0.07,-0.631,958.7,10.0,11.02,11.03,9.2,11.06,10.95,10562490.0,PROP
5,ComfortDelGro,C52,2.3,0.04,1.77,4902.9,286.3,2.3,2.31,654.3,2.32,2.25,11225389.0,TSC
6,DBS,D05,26.83,-0.11,-0.408,2993.6,4.9,26.83,26.84,3.8,27.2,26.8,80860990.0,FIN
7,Genting Sing,G13,1.22,-,-,19180.2,3408.1,1.21,1.22,3008.4,1.24,1.21,23447047.0,Hotels
8,Golden Agri-Res,E5H,0.31,-0.005,-1.587,2760.7,7885.8,0.31,0.315,8226.8,0.315,0.31,866624.5,AGR
9,HongkongLand USD,H78,7.17,-0.01,-0.139,411.7,1.0,7.17,7.18,10.2,7.19,7.15,2951162.0,PROP


In [4]:
type(stock)

pandas.core.frame.DataFrame

We look at the type of data that are in each column.  object are strings whilst float are numeric.
Note that even though we see "Value" as numeric in the csv file, it is read in as strings in this case.  
This is because of the commas (',') in the thousand places of the units.

We want to add up the "Value" of all the 30 stocks.  What should we do?

In [2]:
stock.dtypes

Counter_Name      object
Code              object
Last              object
Chg               object
PercentChange     object
Vol               object
B Vol             object
Buy              float64
Sell             float64
S Vol             object
High             float64
Low              float64
Value             object
Sector            object
dtype: object

The .shape method indicates that the stock DataFrame has 31 rows and 14 columns. 

In [3]:
stock.shape

(31, 14)

We can access specific rows by using the iloc[] method.  Below we look at the second row.

iloc[] extracts the index of the rows (as opposed to columns).  So iloc[1] is the second row.

In [15]:
stock.iloc[1]

Counter_Name     Ascendas Reit
Code                      A17U
Last                      2.63
Chg                       0.01
PercentChange            0.382
Vol                   2,382.50
B Vol                    425.5
Buy                       2.62
Sell                      2.63
S Vol                    198.6
High                      2.64
Low                       2.61
Value             6,245,997.50
Sector                    PROP
Name: 1, dtype: object

The last row contains no data - so we wish to get rid of it by creating a new DataFrame (does not matter that it is the same name) without the last row

In [15]:
print(len(stock)) # before removing of NaN row: 31
stock = stock[stock['Counter_Name'].notnull()]
print(len(stock)) # After removing of NaN row: 30
stock

31
30


Unnamed: 0,Counter_Name,Code,Last,Chg,PercentChange,Vol,B Vol,Buy,Sell,S Vol,High,Low,Value,Sector
0,OCBC Bank,O39,11.68,-,-,2824.5,21.2,11.67,11.68,15.7,11.76,11.61,32993891.0,FIN
1,Ascendas Reit,A17U,2.63,0.01,0.382,2382.5,425.5,2.62,2.63,198.6,2.64,2.61,6245997.5,PROP
2,CapitaLand,C31,3.18,-0.02,-0.625,4501.2,569.0,3.18,3.19,722.7,3.21,3.17,14339453.0,PROP
3,CapitaMall Trust,C38U,2.02,-0.02,-0.98,5270.2,3126.3,2.02,2.03,1173.5,2.04,2.01,10677863.6,PROP
4,CityDev,C09,11.02,-0.07,-0.631,958.7,10.0,11.02,11.03,9.2,11.06,10.95,10562490.0,PROP
5,ComfortDelGro,C52,2.3,0.04,1.77,4902.9,286.3,2.3,2.31,654.3,2.32,2.25,11225389.0,TSC
6,DBS,D05,26.83,-0.11,-0.408,2993.6,4.9,26.83,26.84,3.8,27.2,26.8,80860990.0,FIN
7,Genting Sing,G13,1.22,-,-,19180.2,3408.1,1.21,1.22,3008.4,1.24,1.21,23447047.0,Hotels
8,Golden Agri-Res,E5H,0.31,-0.005,-1.587,2760.7,7885.8,0.31,0.315,8226.8,0.315,0.31,866624.5,AGR
9,HongkongLand USD,H78,7.17,-0.01,-0.139,411.7,1.0,7.17,7.18,10.2,7.19,7.15,2951162.0,PROP


In [16]:
stock['Value']

KeyError: 'Value'

After a number of tries at accessing the columns and getting KeyError: 'Value' above, I decide to investigate how the column headings are actually represented.  The command stock.columns prints out the names of the column headings.  Surprise - there are trailing white spaces in the last twelve names.  This is causing the error as I had asked for "Value" which does not have the spaces.

In [21]:
stock.columns

Index(['Counter_Name', 'Code', 'Last ', 'Chg ', 'PercentChange', 'Vol ',
       'B Vol ', 'Buy ', 'Sell ', 'S Vol ', 'High ', 'Low ', 'Value ',
       'Sector '],
      dtype='object')

Rather than try to work with the white spaces, we run a command to get rid of the trailing spaces as below.

Now we get the results we want.

In [18]:
stock.columns = stock.columns.str.strip()
stock.columns

Index(['Counter_Name', 'Code', 'Last', 'Chg', 'PercentChange', 'Vol', 'B Vol',
       'Buy', 'Sell', 'S Vol', 'High', 'Low', 'Value', 'Sector'],
      dtype='object')

In [23]:
stock['Value']

0        32,993,891
1      6,245,997.50
2        14,339,453
3     10,677,863.60
4        10,562,490
5        11,225,389
6        80,860,990
7        23,447,047
8        866,624.50
9         2,951,162
10     5,028,804.50
11        2,462,513
12        4,810,851
13        3,136,015
14       22,923,951
15       12,568,100
16        3,782,204
17        8,244,786
18        6,948,203
19        4,726,874
20       37,194,598
21        2,809,424
22        8,198,752
23       11,191,367
24     5,522,935.40
25       34,787,502
26        5,204,928
27       19,224,610
28       10,100,631
29       16,505,420
Name: Value, dtype: object

Note that 'Values' is not numeric and cannot be summed.

In [24]:
sum(stock['Value'])

TypeError: unsupported operand type(s) for +: 'int' and 'str'

We try to use the same function as in the previous lesson on reading text file to convert the strings to float.  However, it does not work this time as we are trying to use a single command to change for the entire series.

In [25]:
stock['ValueFloat']=float(stock['Value'].replace(',',''))

TypeError: cannot convert the series to <class 'float'>

We use list comprehension to replace the commas(which is similar to a loop except that it is much neater)

In [19]:
stock['ValueFloat']=[x.replace(',','') for x in stock['Value']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [20]:
stock['ValueFloat']

0        32993891
1      6245997.50
2        14339453
3     10677863.60
4        10562490
5        11225389
6        80860990
7        23447047
8       866624.50
9         2951162
10     5028804.50
11        2462513
12        4810851
13        3136015
14       22923951
15       12568100
16        3782204
17        8244786
18        6948203
19        4726874
20       37194598
21        2809424
22        8198752
23       11191367
24     5522935.40
25       34787502
26        5204928
27       19224610
28       10100631
29       16505420
Name: ValueFloat, dtype: object

We then change the 'ValueFloat' which is string to float using list comprehension

In [27]:
stock['ValueFloat'] = [float(x) for x in stock['ValueFloat']]
stock['ValueFloat']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0     32993891.0
1      6245997.5
2     14339453.0
3     10677863.6
4     10562490.0
5     11225389.0
6     80860990.0
7     23447047.0
8       866624.5
9      2951162.0
10     5028804.5
11     2462513.0
12     4810851.0
13     3136015.0
14    22923951.0
15    12568100.0
16     3782204.0
17     8244786.0
18     6948203.0
19     4726874.0
20    37194598.0
21     2809424.0
22     8198752.0
23    11191367.0
24     5522935.4
25    34787502.0
26     5204928.0
27    19224610.0
28    10100631.0
29    16505420.0
Name: ValueFloat, dtype: float64

We can finally sum up the market value.  With the appropriate format included.

In [28]:
print('{} {:0,.0f}'.format('$',sum(stock['ValueFloat'])))

$ 419,543,376


## 2.0  Writing CSV File

It is a simple procedure to use the function .to_csv to write the whole dataframe to a csv file as illustrated below.  We put the "index=False" to prevent the printing of indices (1,2,3 . . .) in the first column of the csv file.

In [29]:
stock.to_csv('MarketValue.csv', index=False)

If we want to only write out specific columns, you must specify them using the columns parameters.

In [33]:
stock.to_csv('MarketValueSelected.csv', index=False, columns = ["Counter_Name", "ValueFloat"])

We can select to write selected lines individually into the csv file instead of writing the whole dataframe.

Note that the end of each line has the format statement '\n'.

This will make sure that the next line of data is written on a new line.

In [35]:
with open('MarketValueLine.csv',"w") as fileout:
    fileout.write(','.join(map(str,stock.columns))+'\n')
    for i in range(len(stock)):
        fileout.write(','.join(map(str,stock.iloc[i]))+'\n')

In [36]:
import csv
writer = csv.writer(open('MarketValueCSVWriter.csv', 'w'))

writer.writerows(str(stock) +'\n')
del writer