# Play with Data

- ### [What is Data?]()
- ### [Loading data in Python?]()
- ### [How can we represent it?]()

#### What is Data?

- Structured (Log information, Database Entries).
- Unstructured Data.

### Data Representations
- CSVs
- Matrices

### Data Loading in python.

```
file = open(filename, fileflag)
## Read the lines
## Close the file
```

In [1]:
# load the csv file : Data/test.csv.
file = open("Data/test.csv", 'r')
file_data = file.read()

In [2]:
# Display the file string representation.
file_data

'Loan_ID,Gender,Married,Education,ApplicantIncome,LoanAmount,Property_Area\nLP001015,Male,Yes,Graduate,5720,110,Urban\nLP001022,Male,Yes,Graduate,3076,126,Urban\nLP001031,Male,Yes,Graduate,5000,208,Urban\nLP001035,Male,Yes,Graduate,2340,100,Urban\nLP001051,Male,No,Not Graduate,3276,78,Urban\nLP001054,Male,Yes,Not Graduate,2165,152,Urban\nLP001055,Female,No,Not Graduate,2226,59,Semiurban\nLP001056,Male,Yes,Not Graduate,3881,147,Rural\nLP001059,Male,Yes,Graduate,13633,280,Urban\nLP001067,Male,No,Not Graduate,2400,123,Semiurban\nLP001078,Male,No,Not Graduate,3091,90,Urban\nLP001082,Male,Yes,Graduate,2185,162,Semiurban\nLP001083,Male,No,Graduate,4166,40,Urban\nLP001094,Male,Yes,Graduate,12173,166,Semiurban\nLP001096,Female,No,Graduate,4666,124,Semiurban\nLP001099,Male,No,Graduate,5667,131,Urban'

#### Data
As we can see, the data is unstructured. This is because '\n' or newline special character is encoded as it is instead of opening a new line. This character helps us split the string into lines. We just need to split the data at every newline character that we encounter.

### Representing in a List

In [3]:
# Split lines into list.
file_data_lines = file_data.split('\n')
file_data_lines

['Loan_ID,Gender,Married,Education,ApplicantIncome,LoanAmount,Property_Area',
 'LP001015,Male,Yes,Graduate,5720,110,Urban',
 'LP001022,Male,Yes,Graduate,3076,126,Urban',
 'LP001031,Male,Yes,Graduate,5000,208,Urban',
 'LP001035,Male,Yes,Graduate,2340,100,Urban',
 'LP001051,Male,No,Not Graduate,3276,78,Urban',
 'LP001054,Male,Yes,Not Graduate,2165,152,Urban',
 'LP001055,Female,No,Not Graduate,2226,59,Semiurban',
 'LP001056,Male,Yes,Not Graduate,3881,147,Rural',
 'LP001059,Male,Yes,Graduate,13633,280,Urban',
 'LP001067,Male,No,Not Graduate,2400,123,Semiurban',
 'LP001078,Male,No,Not Graduate,3091,90,Urban',
 'LP001082,Male,Yes,Graduate,2185,162,Semiurban',
 'LP001083,Male,No,Graduate,4166,40,Urban',
 'LP001094,Male,Yes,Graduate,12173,166,Semiurban',
 'LP001096,Female,No,Graduate,4666,124,Semiurban',
 'LP001099,Male,No,Graduate,5667,131,Urban']

In [4]:
# Splitting induvidual lines.

# The original string representation.
print(file_data_lines[1])
# The String split into different columns.
file_data_lines[1].split(',')

LP001015,Male,Yes,Graduate,5720,110,Urban


['LP001015', 'Male', 'Yes', 'Graduate', '5720', '110', 'Urban']

#### Creating a List of Lists
We need to be able to access every cell and all the values. And for that, we will split every single string rows that are in the list into list of cell values.

In [5]:
# Create the final cleaned list.
cleaned_file = []
# Loop to iterate and process each line.
for line in file_data_lines:
    processed_line = line.split(',')
    cleaned_file.append(processed_line)

In [6]:
# Display the cleaned list.
cleaned_file

[['Loan_ID',
  'Gender',
  'Married',
  'Education',
  'ApplicantIncome',
  'LoanAmount',
  'Property_Area'],
 ['LP001015', 'Male', 'Yes', 'Graduate', '5720', '110', 'Urban'],
 ['LP001022', 'Male', 'Yes', 'Graduate', '3076', '126', 'Urban'],
 ['LP001031', 'Male', 'Yes', 'Graduate', '5000', '208', 'Urban'],
 ['LP001035', 'Male', 'Yes', 'Graduate', '2340', '100', 'Urban'],
 ['LP001051', 'Male', 'No', 'Not Graduate', '3276', '78', 'Urban'],
 ['LP001054', 'Male', 'Yes', 'Not Graduate', '2165', '152', 'Urban'],
 ['LP001055', 'Female', 'No', 'Not Graduate', '2226', '59', 'Semiurban'],
 ['LP001056', 'Male', 'Yes', 'Not Graduate', '3881', '147', 'Rural'],
 ['LP001059', 'Male', 'Yes', 'Graduate', '13633', '280', 'Urban'],
 ['LP001067', 'Male', 'No', 'Not Graduate', '2400', '123', 'Semiurban'],
 ['LP001078', 'Male', 'No', 'Not Graduate', '3091', '90', 'Urban'],
 ['LP001082', 'Male', 'Yes', 'Graduate', '2185', '162', 'Semiurban'],
 ['LP001083', 'Male', 'No', 'Graduate', '4166', '40', 'Urban'],
 [

#### Accessing Data
Now that we have represented our data in the form of a 2D matrix. It is very easy for us to access the elements. We can access it easy by:

```
matrix[row][column]
```

In [7]:
# Accessing LP001054 in the Loan ID.
# Access a 2D matrix as matrix[row][column].
cleaned_file[6][0]

'LP001054'

### Representing in a Dictionary.

The data can be reperesented in a Python dictionary as follow :
```
data['column_name'] = [List of values in that particular column]
```
For that, we will create a dictionary with keys as the column names and data represented as list. Then, we will iterate through the rows containing the data and then append the values of a row to the corresponding keys of the column.

In [13]:
# Converting the data into a dictionary.
# Creating the dicitonary with columns.
cols = cleaned_file[0]
loan_data_dict = dict.fromkeys(cols)

In [14]:
# Initialize the dictionary with empty lists.
for column in cols:
    loan_data_dict[column] = []

In [15]:
# Display the dictionary.
loan_data_dict

{'ApplicantIncome': [],
 'Education': [],
 'Gender': [],
 'LoanAmount': [],
 'Loan_ID': [],
 'Married': [],
 'Property_Area': []}

In [16]:
# Append the values to the respective columns.
for row in range(1, len(cleaned_file)):
    loan_data_dict['Loan_ID'].append(cleaned_file[row][0])
    loan_data_dict['Gender'].append(cleaned_file[row][1])
    loan_data_dict['Married'].append(cleaned_file[row][2])
    loan_data_dict['Education'].append(cleaned_file[row][3])
    loan_data_dict['ApplicantIncome'].append(cleaned_file[row][4])
    loan_data_dict['LoanAmount'].append(cleaned_file[row][5])
    loan_data_dict['Property_Area'].append(cleaned_file[row][6])

In [17]:
loan_data_dict

{'ApplicantIncome': ['5720',
  '3076',
  '5000',
  '2340',
  '3276',
  '2165',
  '2226',
  '3881',
  '13633',
  '2400',
  '3091',
  '2185',
  '4166',
  '12173',
  '4666',
  '5667'],
 'Education': ['Graduate',
  'Graduate',
  'Graduate',
  'Graduate',
  'Not Graduate',
  'Not Graduate',
  'Not Graduate',
  'Not Graduate',
  'Graduate',
  'Not Graduate',
  'Not Graduate',
  'Graduate',
  'Graduate',
  'Graduate',
  'Graduate',
  'Graduate'],
 'Gender': ['Male',
  'Male',
  'Male',
  'Male',
  'Male',
  'Male',
  'Female',
  'Male',
  'Male',
  'Male',
  'Male',
  'Male',
  'Male',
  'Male',
  'Female',
  'Male'],
 'LoanAmount': ['110',
  '126',
  '208',
  '100',
  '78',
  '152',
  '59',
  '147',
  '280',
  '123',
  '90',
  '162',
  '40',
  '166',
  '124',
  '131'],
 'Loan_ID': ['LP001015',
  'LP001022',
  'LP001031',
  'LP001035',
  'LP001051',
  'LP001054',
  'LP001055',
  'LP001056',
  'LP001059',
  'LP001067',
  'LP001078',
  'LP001082',
  'LP001083',
  'LP001094',
  'LP001096',
  'LP

In [20]:
# Accessing just the LoadIds
loan_data_dict['Loan_ID']

['LP001015',
 'LP001022',
 'LP001031',
 'LP001035',
 'LP001051',
 'LP001054',
 'LP001055',
 'LP001056',
 'LP001059',
 'LP001067',
 'LP001078',
 'LP001082',
 'LP001083',
 'LP001094',
 'LP001096',
 'LP001099']

### In conclusion
The use of list and Lists and Dictionaries to process data was to give you a very fundamental understanding on how data can be processed and loaded. However, when it comes to huge data, we will need a better and faster framework. Looping poses a serious overhead to performance. We will be discussing more on Numpy and Pandas and discuss how we can load and process data easily and efficiently.

This free course on [Udacity](https://classroom.udacity.com/courses/ud170-india) can help understanding data processing in great detail.