#### Importing and exporting data in python

##### Importing data

Dataset url: https://archive.ics.uci.edu/dataset/10/automobile

In [None]:
import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
# Since this dataset doesn't have a header row, you might need to specify column names.
# Refer to the dataset documentation for appropriate column names.
df = pd.read_csv(url, header=None)

#d1= pd.read_csv(<CSV_path>, header = 0) # load using first row as header

# print(df.head())

print(df.tail())

headers = [
    "symboling", "normalized-losses", "make", "fuel-type", "aspiration", 
    "num-of-doors", "body-style", "drive-wheels", "engine-location", 
    "wheel-base", "length", "width", "height", "curb-weight", 
    "engine-type", "num-of-cylinders", "engine-size", "fuel-system", 
    "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", 
    "city-mpg", "highway-mpg", "price"
]

df.columns = headers

path = "automobile.csv"

df.to_csv(path)

Check the column values

In [12]:
df["normalized-losses"]
df["normalized-losses"].describe()

count     205
unique     52
top         ?
freq       41
Name: normalized-losses, dtype: object

The replace() method in pandas is used to replace specific values in a pandas DataFrame or Series with new values.

To replace the "?" symbol with NaN in a pandas DataFrame, you can use the replace() method along with np.nan.

In [13]:
# df.replace('?',np.nan,inplace=True)

df1 = df.replace('?',np.nan)

df1["normalized-losses"].describe()

count     164
unique     51
top       161
freq       11
Name: normalized-losses, dtype: object

In Pandas, `dropna()` is a method used to remove missing values (`NaN`) from a DataFrame or Series. It allows you to drop rows or columns with missing values depending on the configuration. Here's a breakdown of its usage:

### Syntax:
```python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

### Parameters:
- **`axis`**:  
  - `0` (default): Drop rows with missing values.
  - `1`: Drop columns with missing values.
  
- **`how`**:  
  - `'any'` (default): Drops rows or columns where **any** value is missing.
  - `'all'`: Drops rows or columns only if **all** values are missing.

- **`thresh`**:  
  - Specifies the minimum number of non-`NaN` values required to keep a row or column.

- **`subset`**:  
  - A list of specific columns or rows to check for missing values.

- **`inplace`**:  
  - `False` (default): Returns a new DataFrame/Series with `NaN` values removed.
  - `True`: Drops `NaN` values in the original DataFrame/Series.
 

In [17]:
df = df1.dropna(subset=["price"],axis=0)
#df["price"].describe()
df.head(20)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
5,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
7,1,,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
8,1,158.0,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
10,2,192.0,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


To find the column names of a DataFrame in pandas, you can use the columns attribute.

In [15]:
df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [19]:
df[["normalized-losses","make","price"]]
df[["normalized-losses","make","price"]].describe()

Unnamed: 0,normalized-losses,make,price
count,164,201,201
unique,51,22,186
top,161,toyota,8921
freq,11,32,2
