# Introduction to importing data with Python

1. **Variety of Data Sources**:
   - Flat files like .txt and .csv.
   - Files native to other software i.e., Excel spreadsheets and files from Stata, SAS, and MATLAB.
   - Relational databases including SQLite and PostgreSQL.

2. **Classification of Text Files**:
   - Plain text files, for example, literary works like excerpts from Mark Twain's "The Adventures of Huckleberry Finn".
   - Structured data text files like "titanic.csv", where each row represents an individual (such as a passenger) and each column denotes attributes like gender, cabin, and survival status.

3. **Reading and Handling Text Files**:
   - Basic file reading using Python's `open` function with mode 'r' for read-only access, ensuring you don’t accidentally modify the file.
   - Importance of closing the file connection after reading, using `file.close()`.
   - Best practices like using the context manager (`with` statement) to automatically handle file closing.

   - How to display the contents of a file in the console using `print()`.
   - Different file access modes, with a mention of mode 'w' for writing to files, though not focused on in this course.
4. **Enhanced File Management**:
   - Introduction to using context managers for optimal file management, ensuring files are closed post-operations.
   - Interactive exercises will include tasks like printing file contents and specific lines, useful for managing large files.

5. **Using NumPy for Efficient Data Handling**:
   - An introduction to the Python library NumPy to facilitate the import and management of numerical data from flat files.

These points provide a comprehensive overview of the necessary skills for importing and managing various data file types using Python, focusing on practical methods and best practices.

#  Dealing with flat files

## 1- Reading a text file

In [16]:
file_name= "sample.txt"
# file= open(file_name, mode="r")
file= open(file_name, "r")
txt= file.read()

In [11]:
file.close()

In [12]:
txt

'Welcome to the phase 2 of data analysis! I am your trainer Zartashia Afzal.'

In [13]:
print(txt)

Welcome to the phase 2 of data analysis! I am your trainer Zartashia Afzal.


# Writing to any file

In [17]:
file_name= "sample.txt"
file = open(file_name, mode='w')
file.close()

# Context manager with

In [19]:
with open('sample.txt', 'r') as file:
    print(file.read())




**Practice**

-   Open the txt file as read-only.
-   Print the contents.
-   Check whether the file is closed.
-   Close the file.
-   Check again that the file is closed.

In [21]:
# Open a file: file
file = open("sample.txt", mode="r")

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file

file.close()
# Check whether file is closed

print(file.closed)


False
True


### Magic commands

In [20]:
! ls

'ls' is not recognized as an internal or external command,
operable program or batch file.


### Importing text files linewise

# Zen of Python

In [24]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Importing Data with NumPy

-   NumPy is essential for numerical data handling in Python.
-   Use `loadtxt()` or `genfromtxt()` for importing numerical datasets, like MNIST


In [2]:
import numpy as np

In [6]:
data = np.loadtxt('mnist.txt', delimiter=',')

# Introduction to Importing Flat Files with NumPy
- Brief overview of opening text files using Python's built-in `open` function.
- Introduction to using NumPy for importing numerical data into numpy arrays.
- Benefits of using numpy arrays: efficiency, speed, and clean data handling.

# Why Use NumPy?
- NumPy arrays as the Python standard for numerical data storage.
- Importance for other Python packages, like scikit-learn.
- Built-in functions in NumPy like `loadtxt` and `genfromtxt` facilitate efficient data import.

# Basic Import with NumPy
- How to import NumPy: `import numpy as np`
- Using `np.loadtxt` to load flat files: requires filename and delimiter.
- Setting the delimiter: Default is whitespace, often needs to be specified.

# Customizing NumPy Import
- Optional parameters to better control data import:
  - `skiprows` to skip header rows, e.g., `skiprows=1`.
  - `usecols` to specify which columns to import, e.g., `usecols=[0, 2]`.

# Handling Different DataTypes
- Using `dtype` parameter to specify data types, e.g., `dtype='str'` for importing all strings.
- Challenges with mixed data types in flat files.
- Example of issues: Importing complex datasets like the Titanic dataset with both floats and strings.



# Importing Data with Pandas

Pandas provides a powerful DataFrame object, which is ideal for handling labeled data with columns of potentially different types. Here’s how to import data using pandas:

```
import pandas as pd
data = pd.read_csv('titanic.csv')

# Flat Files in Practice

Flat files like `.csv` are widely used for simple and straightforward data storage. They represent data in rows and columns, making them easy to process and analyze.

**Why does to identify type syntax is written as # Print datatype of digits**
**print(type(digits)) and not like this: print(digits.type()).**
 
 The way types are identified in Python reflects its design philosophy and structural decisions. In Python, the syntax `print(type(digits))` is used to determine the type of an object like `digits` for a few key reasons:

1. **Built-in `type()` Function**: Python uses the built-in function `type()` to return the type of an object. The syntax `type(object)` is consistent with other built-in functions like `len()` for length and `id()` for identity, which also follow the pattern of `function(argument)`. This consistency makes Python intuitive and predictable.

2. **Separation of Data and Operations**: In Python, types are not bound as methods to the objects. This design decision means that objects like integers, strings, or lists do not carry their type-checking methods. Instead, such functionalities are implemented through external functions (like `type()`) or in specific methods bound to all objects (like `__class__`). This separation helps keep objects lightweight and minimizes redundancy in method definitions across different types.

3. **Python’s Dynamic Typing**: Python is a dynamically typed language, meaning that the type of a variable is determined at runtime. This dynamic nature is why a separate function like `type()` is more suitable. It can be applied to any variable at any point in its lifecycle, regardless of its current type, enhancing flexibility.

4. **Object-Oriented Approach**: The syntax `print(digits.type())` implies that `type` would be a method specific to the `digits` object. However, in Python, not every object needs to define its own method to reveal its type; instead, the universal `type()` function queries an object for its type. This approach adheres to Python's philosophy of simple, readable code that can be universally applied.

In essence, `type(object)` as a function call is an example of Python's approach to providing utilities that are general and can be used across all types of objects, making the language easy to learn and use while maintaining a clear and consistent syntax.

-   Complete the first call to np.loadtxt() by passing file as the first argument.
-   Execute print(data[0]) to print the first element of data.
-   Complete the second call to np.loadtxt(). The file you're importing is tab-delimited, the datatype is float, and you want to skip the first row.
-   Print the 10th element of data_float by completing the print() command. Be guided by the previous print() call.
-   Execute the rest of the code to visualize the data.

In [10]:
import matplotlib.pyplot as plt

In [None]:
# Assign filename: file
file = 'seaslug.txt'

# Import file: data using np_load
data = 

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = 

# Print the 10th element of data_float
print(____)

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

#  **`plt.scatter(data_float[:, 0], data_float[:, 1])`**:
# - `data_float[:, 0]`: This part of the code is indexing a multidimensional array named `data_float`. The `[:, 0]` part means "select all rows from the first column". This is likely the x-axis data for the scatterplot.
# - `data_float[:, 1]`: Similarly, this indexes the second column of the `data_float` array, selecting all rows from this column. This represents the y-axis data for the scatterplot.

# The code essentially takes a dataset (`data_float`) which appears to be in the form of a 2D array, and plots the first column of the array against the second column as points in a scatterplot. Each point in the scatterplot represents one row of the dataset, with its respective values from the first and second columns determining its position on the x-axis and y-axis, respectively.



# Working with mixed datatypes
Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function `np.loadtxt()` will freak at this. There is another function, `np.genfromtxt()`, which can handle such structures. If we pass `dtype=None` to it, it will figure out what types each column should be.

-   Import 'titanic.csv' using the function np.genfromtxt()
-   print the entire column with the name Survived. What are the last 4 values of this column?

In [25]:
data = 


  data = np.genfromtxt('titanic.csv', delimiter=",", names=True, dtype=None)


array([1, 0, 1, 0])

###  There is also another function `np.recfromcsv()` that behaves similarly to np.genfromtxt(), except that its default dtype is None

-   Import titanic.csv using the function np.recfromcsv() and assign it to the variable, d. You'll only need to pass file to it because it has the defaults delimiter=',' and names=True in addition to dtype=None!
-   Run the remaining code to print the first three entries of the resulting array d.

In [None]:
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d=np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])
