## 데이터 준비
- https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows/

In [34]:
import pandas as pd

In [35]:
cols = ['Series_Title', 'Released_Year', 'Meta_score', 'IMDB_Rating', 'Overview']

In [36]:
# 가져올 열 지정하기
mdf = pd.read_csv('imdb_top_1000.csv', usecols=cols)
mdf

Unnamed: 0,Series_Title,Released_Year,IMDB_Rating,Overview,Meta_score
0,The Shawshank Redemption,1994,9.3,Two imprisoned men bond over a number of years...,80.0
1,The Godfather,1972,9.2,An organized crime dynasty's aging patriarch t...,100.0
2,The Dark Knight,2008,9.0,When the menace known as the Joker wreaks havo...,84.0
3,The Godfather: Part II,1974,9.0,The early life and career of Vito Corleone in ...,90.0
4,12 Angry Men,1957,9.0,A jury holdout attempts to prevent a miscarria...,96.0
...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,7.6,A young New York socialite becomes interested ...,76.0
996,Giant,1956,7.6,Sprawling epic covering the life of a Texas ca...,84.0
997,From Here to Eternity,1953,7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0
998,Lifeboat,1944,7.6,Several survivors of a torpedoed merchant ship...,78.0


## 열 조회

In [37]:
# 단일 열 조회를 하면 Series가 return된다.
mdf['Series_Title']

0      The Shawshank Redemption
1                 The Godfather
2               The Dark Knight
3        The Godfather: Part II
4                  12 Angry Men
                 ...           
995      Breakfast at Tiffany's
996                       Giant
997       From Here to Eternity
998                    Lifeboat
999                The 39 Steps
Name: Series_Title, Length: 1000, dtype: object

In [38]:
# 다중 열 조회를 하면 DataFrame
mdf[['Series_Title','IMDB_Rating']]

Unnamed: 0,Series_Title,IMDB_Rating
0,The Shawshank Redemption,9.3
1,The Godfather,9.2
2,The Dark Knight,9.0
3,The Godfather: Part II,9.0
4,12 Angry Men,9.0
...,...,...
995,Breakfast at Tiffany's,7.6
996,Giant,7.6
997,From Here to Eternity,7.6
998,Lifeboat,7.6


In [39]:
# 없는 열을 조회하면 Error
mdf['AAA']

KeyError: 'AAA'

좀 더 안전하게 가져올 땐, 파이썬 Dictionary처럼 .get()를 이용해보자.

In [40]:
mdf.get(['Series_Title'])


Unnamed: 0,Series_Title
0,The Shawshank Redemption
1,The Godfather
2,The Dark Knight
3,The Godfather: Part II
4,12 Angry Men
...,...
995,Breakfast at Tiffany's
996,Giant
997,From Here to Eternity
998,Lifeboat


In [41]:
mdf.get(['Series_Title','IMDB_Rating'])

Unnamed: 0,Series_Title,IMDB_Rating
0,The Shawshank Redemption,9.3
1,The Godfather,9.2
2,The Dark Knight,9.0
3,The Godfather: Part II,9.0
4,12 Angry Men,9.0
...,...,...
995,Breakfast at Tiffany's,7.6
996,Giant,7.6
997,From Here to Eternity,7.6
998,Lifeboat,7.6


In [42]:
mdf.get('AAA') #none 오류안남

1. 에러 발생을 제어할 건지
2. `.get()`함수로 안전하게 조회할 건지 결정
3. `[ ]` 로 접근하는 건 address access고 .get()조회하는 건 return value(값 반환)이라서
4. 전자는 접근 후 수정이 되고, 후자는 수정이 되지 않는다.

## 열 추가 - 1. 일련화 된 값
- 내가 본 영화를 체크하는 란을 만들자.
- 리스트의 .append()처럼 위치는 지정할 수 없다. (뒤에 붙음)

In [43]:
# 마치 dict형에서 값 추가하듯
mdf['watch']=False


In [44]:
# 확인해 보면
mdf.head()

Unnamed: 0,Series_Title,Released_Year,IMDB_Rating,Overview,Meta_score,watch
0,The Shawshank Redemption,1994,9.3,Two imprisoned men bond over a number of years...,80.0,False
1,The Godfather,1972,9.2,An organized crime dynasty's aging patriarch t...,100.0,False
2,The Dark Knight,2008,9.0,When the menace known as the Joker wreaks havo...,84.0,False
3,The Godfather: Part II,1974,9.0,The early life and career of Vito Corleone in ...,90.0,False
4,12 Angry Men,1957,9.0,A jury holdout attempts to prevent a miscarria...,96.0,False


## 열 추가 - 2. .insert()를 이용한 일련화된 값
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html
- 나만의 평점을 추가해본다고 하자
- .insert()를 이용하면 위치도 지정할 수 있다.

**loc : int**
Insertion index. Must verify 0 <= loc <= len(columns).

**column :str, number, or hashable object**
Label of the inserted column.

**value : Scalar, Series, or array-like**

**allow_duplicates : bool, optional, default lib.no_default**

- Scalar : 단일 값

In [45]:
# Watched 앞에 넣어보자 ->  추가된 후 최종 위치 -> 0 1 2 3 4 [5] 6위치
mdf.insert(loc=5, column='My_score', value=None)

In [46]:
# 조회
mdf.head()

Unnamed: 0,Series_Title,Released_Year,IMDB_Rating,Overview,Meta_score,My_score,watch
0,The Shawshank Redemption,1994,9.3,Two imprisoned men bond over a number of years...,80.0,,False
1,The Godfather,1972,9.2,An organized crime dynasty's aging patriarch t...,100.0,,False
2,The Dark Knight,2008,9.0,When the menace known as the Joker wreaks havo...,84.0,,False
3,The Godfather: Part II,1974,9.0,The early life and career of Vito Corleone in ...,90.0,,False
4,12 Angry Men,1957,9.0,A jury holdout attempts to prevent a miscarria...,96.0,,False


## 열 추가 - 3. 다양한 값 추가(Series or array-like)

- value인자 설명을 보면 Series or array-like
- Series와 array-like가 개수만 맞다면 추가 가능

일단은 `My_Score`, `Watched`를 삭제 drop합시다.

In [48]:
# 지난 수업 복습겸 작성해보자.
mdf.drop(columns=['My_score', 'watch'], inplace=True)

In [50]:
mdf.head()

Unnamed: 0,Series_Title,Released_Year,IMDB_Rating,Overview,Meta_score
0,The Shawshank Redemption,1994,9.3,Two imprisoned men bond over a number of years...,80.0
1,The Godfather,1972,9.2,An organized crime dynasty's aging patriarch t...,100.0
2,The Dark Knight,2008,9.0,When the menace known as the Joker wreaks havo...,84.0
3,The Godfather: Part II,1974,9.0,The early life and career of Vito Corleone in ...,90.0
4,12 Angry Men,1957,9.0,A jury holdout attempts to prevent a miscarria...,96.0


### 예시1. Watched를 리스트로 추가해보기

In [None]:
# 예시 watched 생성
watched=[True, False]*500
watched

In [55]:
# 처음 배운 [] 방식으로 추가하기
mdf['watched']=watched
mdf.head()

Unnamed: 0,Series_Title,Released_Year,IMDB_Rating,Overview,Meta_score,watched
0,The Shawshank Redemption,1994,9.3,Two imprisoned men bond over a number of years...,80.0,True
1,The Godfather,1972,9.2,An organized crime dynasty's aging patriarch t...,100.0,False
2,The Dark Knight,2008,9.0,When the menace known as the Joker wreaks havo...,84.0,True
3,The Godfather: Part II,1974,9.0,The early life and career of Vito Corleone in ...,90.0,False
4,12 Angry Men,1957,9.0,A jury holdout attempts to prevent a miscarria...,96.0,True


### 예시 2. My_Score를 시리즈로 추가해보기

In [56]:
import random
my_score = [random.randint(0, 10) for x in range(1000)]
my_score

[8,
 4,
 10,
 1,
 0,
 8,
 9,
 0,
 8,
 3,
 3,
 9,
 5,
 0,
 8,
 3,
 9,
 7,
 2,
 0,
 8,
 8,
 2,
 9,
 6,
 7,
 8,
 0,
 4,
 7,
 7,
 7,
 4,
 0,
 1,
 10,
 0,
 6,
 8,
 7,
 4,
 6,
 8,
 3,
 3,
 9,
 1,
 4,
 9,
 2,
 5,
 4,
 10,
 0,
 7,
 8,
 6,
 5,
 8,
 7,
 0,
 3,
 0,
 6,
 6,
 10,
 3,
 10,
 10,
 0,
 0,
 9,
 5,
 10,
 1,
 10,
 9,
 5,
 1,
 2,
 7,
 4,
 2,
 10,
 7,
 8,
 1,
 9,
 4,
 2,
 8,
 3,
 0,
 8,
 1,
 1,
 6,
 4,
 2,
 6,
 4,
 10,
 4,
 3,
 8,
 5,
 7,
 1,
 0,
 4,
 1,
 5,
 1,
 5,
 4,
 2,
 3,
 5,
 1,
 8,
 1,
 10,
 7,
 1,
 6,
 7,
 1,
 7,
 8,
 5,
 5,
 4,
 8,
 9,
 3,
 2,
 9,
 1,
 10,
 10,
 7,
 10,
 5,
 9,
 6,
 6,
 10,
 8,
 9,
 6,
 6,
 5,
 10,
 6,
 2,
 0,
 9,
 1,
 0,
 8,
 2,
 6,
 0,
 4,
 0,
 5,
 0,
 1,
 5,
 10,
 10,
 5,
 10,
 7,
 5,
 4,
 4,
 3,
 3,
 3,
 4,
 1,
 1,
 7,
 4,
 4,
 8,
 6,
 3,
 3,
 6,
 9,
 1,
 5,
 2,
 3,
 9,
 4,
 3,
 6,
 4,
 2,
 7,
 9,
 9,
 1,
 0,
 6,
 10,
 7,
 4,
 6,
 8,
 1,
 10,
 5,
 5,
 6,
 2,
 4,
 6,
 3,
 2,
 1,
 0,
 1,
 2,
 3,
 0,
 9,
 2,
 1,
 0,
 4,
 8,
 3,
 0,
 5,
 8,
 8,
 0,
 8,
 5,
 10,
 4

In [57]:
# 시리즈 생성
my_score_series=pd.Series(my_score)
my_score_series.head()

0     8
1     4
2    10
3     1
4     0
dtype: int64

In [58]:
# IMDB_Rating 앞에 추가해보자. 개수가 동일해야함
mdf.insert(loc=2, column='My_score', value=my_score_series)

In [59]:
# 확인
mdf.head()

Unnamed: 0,Series_Title,Released_Year,My_score,IMDB_Rating,Overview,Meta_score,watched
0,The Shawshank Redemption,1994,8,9.3,Two imprisoned men bond over a number of years...,80.0,True
1,The Godfather,1972,4,9.2,An organized crime dynasty's aging patriarch t...,100.0,False
2,The Dark Knight,2008,10,9.0,When the menace known as the Joker wreaks havo...,84.0,True
3,The Godfather: Part II,1974,1,9.0,The early life and career of Vito Corleone in ...,90.0,False
4,12 Angry Men,1957,0,9.0,A jury holdout attempts to prevent a miscarria...,96.0,True


## 만일 데이터의 개수가 안 맞는다면?

### 테스트 1 : [] + `array-like` 로 열 추가 방식

In [60]:
watched

[True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 Fa

In [61]:
del watched[0]

In [62]:
len(watched)

999

In [63]:
# 테스트 1000개 <-999개  : 에러발생
mdf['watched_new']=watched

ValueError: Length of values (999) does not match length of index (1000)

ValueError: Length of values (999) does not match length of index (1000)

### 테스트 2 : insert+`Series`로 열 추가 방식

In [66]:
del my_score_series[1]
my_score_series.size

998

In [69]:
# 테스트 1000개 <-999개  : 에러 발생?
mdf.insert(loc=2, column='my_new_score', value=my_score_series)

In [70]:
mdf.head()

Unnamed: 0,Series_Title,Released_Year,my_new_score,My_score,IMDB_Rating,Overview,Meta_score,watched
0,The Shawshank Redemption,1994,,8,9.3,Two imprisoned men bond over a number of years...,80.0,True
1,The Godfather,1972,,4,9.2,An organized crime dynasty's aging patriarch t...,100.0,False
2,The Dark Knight,2008,10.0,10,9.0,When the menace known as the Joker wreaks havo...,84.0,True
3,The Godfather: Part II,1974,1.0,1,9.0,The early life and career of Vito Corleone in ...,90.0,False
4,12 Angry Men,1957,0.0,0,9.0,A jury holdout attempts to prevent a miscarria...,96.0,True


1. 일치된 인덱스에 맞춰 값이 추가된다.
2. 일치되지 않은 행은 NaN으로 대체된다.

데이터 전처리를 열심히 하다보면 이렇게 NaN데이터가 쉽게 생겨나기도 한다.

### 테스트 3. [] + `Series` 추가하면?

In [71]:
mdf['new_watch']=my_score_series

In [72]:
mdf.head()

Unnamed: 0,Series_Title,Released_Year,my_new_score,My_score,IMDB_Rating,Overview,Meta_score,watched,new_watch
0,The Shawshank Redemption,1994,,8,9.3,Two imprisoned men bond over a number of years...,80.0,True,
1,The Godfather,1972,,4,9.2,An organized crime dynasty's aging patriarch t...,100.0,False,
2,The Dark Knight,2008,10.0,10,9.0,When the menace known as the Joker wreaks havo...,84.0,True,10.0
3,The Godfather: Part II,1974,1.0,1,9.0,The early life and career of Vito Corleone in ...,90.0,False,1.0
4,12 Angry Men,1957,0.0,0,9.0,A jury holdout attempts to prevent a miscarria...,96.0,True,0.0
