# Smart Read File
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
!pip install azureml



In [2]:
import azureml.dataprep as dprep

DataPrep has the ability to load different kinds of text files. The `smart_read_file` entry point can take any text based file (including excel, json and parquet) and auto-detect how to parse the file. It will also attempt to auto-detect the types of each column and apply type transformations to the columns it detects.

The result will be a Dataflow object that has all the steps added that are required to read the given file(s) and convert their columns to the predicted types. No parameters are required beyond the file path or `FileDataSource` object.

In [3]:
smart_dataflow = dprep.smart_read_file('./data/multiple_separators.csv')
smart_dataflow.head(10)

Unnamed: 0,ID,CaseNumber,Column3,Completed,Column5
0,10140490.0,HY329907,,Y,
1,10139776.0,HY329265,,Y,
2,10140270.0,HY329253,,N,
3,10139885.0,HY329308,,Y,
4,10140379.0,HY329556,,N,
5,10140868.0,HY330421,,N,
6,10139762.0,HY329232,,N,
7,10139722.0,HY329228,,Y,
8,10139774.0,HY329209,,N,
9,10139697.0,HY329177,,N,


Looking at the data, we can see that there are two empty columns either side of the 'Completed' column.
If we compare the dataframe to a few rows from the original file:
```
ID |CaseNumber| |Completed|
10140490 |HY329907| |Y|
10139776 |HY329265| |Y|
```
We can see that the `|`'s have disappeared in the dataframe. This is because `|` is a very common separator character in csv files, so `smart_read_file` guessed it was the column separator. For this data we actually want the `|`'s to remain and instead use space as the column separator.

To acheive this we can use `detect_file_format` which will take a file path or datasource obeject and give back a `FileFormatBuilder` which has learnt some information about the supplied data.
This is what `smart_file_read` is using behind the scenes to 'learn' the contents of the given file and determine how to parse it. With the `FileFormatBuilder` we can take advantage of the intelligent learning aspect of `smart_file_read` but have the chance to modify some of the learnt information.

In [4]:
ffb = dprep.detect_file_format('./data/multiple_separators.csv')
ffb_2 = dprep.detect_file_format('./data/excel.xlsx')
ffb_3 = dprep.detect_file_format('./data/fixed_width_file.txt')
ffb_4 = dprep.detect_file_format('./data/json.json')

print(ffb.file_format)
print(ffb_2.file_format)
print(ffb_3.file_format)
print(type(ffb_4.file_format))

ParseDelimitedProperties
    separator: '|'
    headers_mode: PromoteHeadersMode.CONSTANTGROUPED
    encoding: FileEncoding.UTF8
    quoting: False
    skip_rows: 0
    skip_mode: SkipMode.NONE
    comment: None

ReadExcelProperties
    sheet_name: None
    use_headers: False
    skip_rows: 0

ParseFixedWidthProperties
    offsets: '[7, 13, 43, 46, 52, 58, 65, 73]'
    headers_mode: PromoteHeadersMode.NONE
    encoding: FileEncoding.UTF8
    skip_rows: 0
    skip_mode: SkipMode.NONE

<class 'azureml.dataprep.api.parseproperties.ReadJsonProperties'>


After calling `detect_file_format` we get a `FileFormatBuilder` that has had `learn` called on it. This means the `file_format` attribute will be populated with a `<Parse|Read><type>Properties` object, it contains all the information that was learnt about the file. As we can see above different file types have corresponding file_formats detected. 
Continuing with our delimited example we can change any of these values and then call `ffb.to_dataflow()` to create a `Dataflow` that has the steps required to parse the datasource.

In [5]:
ffb.file_format.separator = ' '
dataflow = ffb.to_dataflow()
df = dataflow.to_pandas_dataframe()
df

Unnamed: 0,ID,|CaseNumber|,|Completed|
0,10140490,|HY329907|,|Y|
1,10139776,|HY329265|,|Y|
2,10140270,|HY329253|,|N|
3,10139885,|HY329308|,|Y|
4,10140379,|HY329556|,|N|
5,10140868,|HY330421|,|N|
6,10139762,|HY329232|,|N|
7,10139722,|HY329228|,|Y|
8,10139774,|HY329209|,|N|
9,10139697,|HY329177|,|N|


The result is our desired dataframe with `|`'s included.

If we refer back to the original data output by `smart_read_file` the 'ID' column was also detected as numeric and converted to a number data type, instead of remaining a string like in the data above.
We can perform type inference on our new dataflow using the `dataflow.builders` property. This property exposes different builders that can `learn` from a dataflow and `apply` the learning to produce a new dataflow, very similar to the pattern we used above for the `FileFormatBuilder`.

In [6]:
ctb = dataflow.builders.set_column_types()
ctb.learn()
ctb.inference_info

{'|CaseNumber|': [FieldType.STRING],
 '|Completed|': [FieldType.STRING],
 'ID': [FieldType.DECIMAL]}

After learning `ctb.inference_info` has been populated with information about the inferred types for each column, it is possible for there to be multiple candidate types per column, in this example there is only one type for each column.

The candidates look correct, we only want to convert `ID` to be a number column (also known as `DECIMAL`), so applying this `ColumnTypesBuilder` should result in a Dataflow with our columns converted to their respective types.

In [7]:
converted_dataflow = ctb.to_dataflow()
converted_df = converted_dataflow.to_pandas_dataframe()
converted_df

Unnamed: 0,ID,|CaseNumber|,|Completed|
0,10140490.0,|HY329907|,|Y|
1,10139776.0,|HY329265|,|Y|
2,10140270.0,|HY329253|,|N|
3,10139885.0,|HY329308|,|Y|
4,10140379.0,|HY329556|,|N|
5,10140868.0,|HY330421|,|N|
6,10139762.0,|HY329232|,|N|
7,10139722.0,|HY329228|,|Y|
8,10139774.0,|HY329209|,|N|
9,10139697.0,|HY329177|,|N|
