# 第2章: UNIXコマンド

```{note}
表示の都合上, 出力行数を制限しています.
```

## 10. 行数のカウント
行数をカウントせよ. 確認にはwcコマンドを用いよ.

**Python:**

In [11]:
with open("popular-names.txt", "r") as f:
    lines = f.readlines()
    cnt = len(lines)
    print(cnt)

2780


**UNIX:**

In [4]:
!wc -l popular-names.txt

2780 popular-names.txt


## 11. タブをスペースに置換
タブ1文字につきスペース1文字に置換せよ. 確認にはsedコマンド, trコマンド, もしくはexpandコマンドを用いよ. 

**Python:**

In [None]:
with open("popular-names.txt", "r") as f:
    lines = (line.replace('\t', ' ') for line in f)
    for i, line in enumerate(lines):
        if i < 10:
            print(line, end='') 

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880


**UNIX:**

`sed` ver.

In [2]:
!sed 's/\t/ /g' popular-names.txt 2>/dev/null | head -n 10

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880


**UNIX:**

`tr` ver.

In [3]:
!tr '\t' ' ' < popular-names.txt 2>/dev/null | head -n 10

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880


`expand` ver.

In [4]:
!expand -t 1 popular-names.txt 2>/dev/null | head -n 10

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880


## 12. 1列目をcol1.txtに，2列目をcol2.txtに保存
各行の1列目だけを抜き出したものをcol1.txtに, 2列目だけを抜き出したものをcol2.txtとしてファイルに保存せよ. 確認にはcutコマンドを用いよ.

**Python:**

In [None]:
with open("popular-names.txt", "r") as f, open("col1.txt", "w") as fc1, open("col2.txt", "w") as fc2:
    for line in f:
        columns = line.split('\t')
        fc1.write(columns[0] + '\n')
        fc2.write(columns[1] + '\n') 

**UNIX:**

In [25]:
!cut -f 1 popular-names.txt | awk 'BEGIN {print "[Check]: col1.txt\nPython UNIX\n"} {getline second < "col1.txt"; print $0, second}' 2>/dev/null | head -n 10

[Check]: col1.txt
Python UNIX

Mary Mary
Anna Anna
Emma Emma
Elizabeth Elizabeth
Minnie Minnie
Margaret Margaret
Ida Ida


In [24]:
!cut -f 2 popular-names.txt | awk 'BEGIN {print "[Check]: col2.txt\nPython UNIX\n"} {getline second < "col2.txt"; print $0, second}' 2>/dev/null | head -n 10

[Check]: col2.txt
Python UNIX

F F
F F
F F
F F
F F
F F
F F


In [140]:
!cut -f 1 popular-names.txt > tmp.txt && diff col1.txt tmp.txt && rm tmp.txt
!cut -f 2 popular-names.txt > tmp.txt && diff col2.txt tmp.txt && rm tmp.txt

## 13. col1.txtとcol2.txtをマージ
12で作ったcol1.txtとcol2.txtを結合し, 元のファイルの1列目と2列目をタブ区切りで並べたテキストファイルを作成せよ, 確認にはpasteコマンドを用いよ.

**Python:**

In [23]:
with open("col1.txt", "r") as fc1, open("col2.txt", "r") as fc2, open("col1-2.txt", "w") as fc12:
    for line1, line2 in zip(fc1, fc2):
        fc12.write(line1.strip() + '\t' + line2.strip() + '\n')

**UNIX:**

In [138]:
!paste col1.txt col2.txt | awk 'BEGIN {print "[Check]: col1-2.txt\nPython UNIX\n"} {getline second < "col1-2.txt"; print $0, second}' | head -n 10

[Check]: col1-2.txt
Python UNIX

Mary	F Mary	F
Anna	F Anna	F
Emma	F Emma	F
Elizabeth	F Elizabeth	F
Minnie	F Minnie	F
Margaret	F Margaret	F
Ida	F Ida	F


In [135]:
!paste col1.txt col2.txt > tmp.txt && diff col1-2.txt tmp.txt && rm tmp.txt

## 14. 先頭からN行を出力
自然数Nをコマンドライン引数などの手段で受け取り, 入力のうち先頭のN行だけを表示せよ. 確認にはheadコマンドを用いよ.

**Python:**

In [42]:
def nhlines(N, file):
    with open(file, "r") as f:
        lines = f.readlines()
        for line in lines[:N]:
            print(line.strip())

nhlines(4, "popular-names.txt")


Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880


**UNIX:**

In [38]:
!head -n 4 popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880


## 15. 末尾のN行を出力
自然数Nをコマンドライン引数などの手段で受け取り, 入力のうち末尾のN行だけを表示せよ. 確認にはtailコマンドを用いよ.

**Python:**

In [43]:
def ntlines(N, file):
    with open(file, "r") as f:
        lines = f.readlines()
        for line in lines[-N:]:
            print(line.strip())

ntlines(4, "popular-names.txt")


Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018


**UNIX:**

In [44]:
!tail -n 4 popular-names.txt

Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018


## 16. ファイルをN分割する
自然数Nをコマンドライン引数などの手段で受け取り, 入力のファイルを行単位でN分割せよ. 同様の処理をsplitコマンドで実現せよ.

**Python:**

In [112]:
def nsplit(N, input_file):
    with open(input_file, "r") as f:
        lines = f.readlines()
    
    total_lines = len(lines)
    split_size = total_lines // N
    rem = total_lines % N
    
    start = 0
    base, ext = input_file.rsplit('.', 1)
    
    for i in range(N):
        end = start + split_size + (1 if i < rem else 0)
        output_file = f"{base}-{i:02}.{ext}"
        with open(output_file, "w") as out_file:
            out_file.write("".join(lines[start:end]))
        start = end

nsplit(3, "popular-names.txt")

**UNIX:**

`nsplit.sh`:

```bash
#!/bin/bash

N=$1
input_file=$2
output_file_prefix=$(basename "$input_file" .txt)"-split-"

total_lines=$(wc -l < "$input_file")

split_size=$((total_lines / N))
rem=$((total_lines % N))

if [ "$rem" -eq 0 ]; then
    split -n "$N" -d --additional-suffix=.txt "$input_file" "$output_file_prefix"
else
    split -l "$((split_size + 1))" -d --additional-suffix=.txt "$input_file" "$output_file_prefix"
fi
```

In [127]:
!chmod +x nsplit.sh
!./nsplit.sh 3 popular-names.txt

In [117]:
!awk 'BEGIN {print "[Check]:\npopular-names-00.txt popular-names-split-00.txt\n"} {getline second < "popular-names-split-00.txt"; print $0, second}' popular-names-00.txt 2>/dev/null | head -n 10

[Check]:
popular-names-00.txt popular-names-split-00.txt

Mary	F	7065	1880 Mary	F	7065	1880
Anna	F	2604	1880 Anna	F	2604	1880
Emma	F	2003	1880 Emma	F	2003	1880
Elizabeth	F	1939	1880 Elizabeth	F	1939	1880
Minnie	F	1746	1880 Minnie	F	1746	1880
Margaret	F	1578	1880 Margaret	F	1578	1880
Ida	F	1472	1880 Ida	F	1472	1880


In [118]:
!awk 'BEGIN {print "[Check]:\npopular-names-01.txt popular-names-split-01.txt\n"} {getline second < "popular-names-split-01.txt"; print $0, second}' popular-names-01.txt 2>/dev/null | head -n 10

[Check]:
popular-names-01.txt popular-names-split-01.txt

Virginia	F	16162	1926 Virginia	F	16162	1926
Mildred	F	13551	1926 Mildred	F	13551	1926
Frances	F	13355	1926 Frances	F	13355	1926
Robert	M	61130	1926 Robert	M	61130	1926
John	M	56110	1926 John	M	56110	1926
James	M	53209	1926 James	M	53209	1926
William	M	51920	1926 William	M	51920	1926


In [120]:
!awk 'BEGIN {print "[Check]:\npopular-names-02.txt popular-names-split-02.txt\n"} {getline second < "popular-names-split-02.txt"; print $0, second}' popular-names-02.txt 2>/dev/null | head -n 10

[Check]:
popular-names-02.txt popular-names-split-02.txt

John	M	43181	1972 John	M	43181	1972
Robert	M	43037	1972 Robert	M	43037	1972
Jason	M	37446	1972 Jason	M	37446	1972
Brian	M	36322	1972 Brian	M	36322	1972
William	M	30529	1972 William	M	30529	1972
Matthew	M	22943	1972 Matthew	M	22943	1972
Jennifer	F	62447	1973 Jennifer	F	62447	1973


In [130]:
!diff popular-names-00.txt popular-names-split-00.txt
!diff popular-names-01.txt popular-names-split-01.txt
!diff popular-names-02.txt popular-names-split-02.txt

```{note}
`split` で ファイルを $N$ 分割する際, `-n` で分割数 $N$ を与えるが, 上手く分割されないことがある. これは, `-n N` の場合, バイト単位で分割されるためであると考えられる. 行単位で分割したい場合は, `-n r/N`, `-n l/N` のようにする必要がある. `r/N` はラウンドロビン方式により $N$ 分割, `l/N` は行単位でバイト数に基づき $N$ 分割する. デフォルト (`-n N` のみ) ではバイト単位で分割される.

**References:**
- [https://www.man7.org/linux/man-pages/man1/split.1.html](https://www.man7.org/linux/man-pages/man1/split.1.html)
- [https://github.com/coreutils/coreutils/blob/master/src/split.c](https://github.com/coreutils/coreutils/blob/master/src/split.c)

```

## 17. １列目の文字列の異なり
1列目の文字列の種類 (異なる文字列の集合) を求めよ. 確認にはcut, sort, uniqコマンドを用いよ.

**Python:**

In [143]:
unique_vals = set()
with open("popular-names.txt", "r") as f:
    for line in f:
        unique_vals.add(line.split()[0])

print("\n".join(sorted(unique_vals)[:10]))

Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amelia
Amy
Andrew
Angela


**UNIX:**

In [10]:
!cut -f 1 popular-names.txt | sort | uniq 2>/dev/null | head -n 10

Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amelia
Amy
Andrew
Angela


## 18. 各行を3コラム目の数値の降順にソート
各行を3コラム目の数値の逆順で整列せよ (注意: 各行の内容は変更せずに並び替えよ).確認にはsortコマンドを用いよ (この問題はコマンドで実行した時の結果と合わなくてもよい).

**Python:**

In [147]:
with open("popular-names.txt", "r") as f:
    lines = f.readlines()

print("".join(sorted(lines, key=lambda x: float(x.split()[2]), reverse=True)[:10]))

Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947
Linda	F	91016	1949
Michael	M	90656	1956
Michael	M	90517	1958
James	M	88584	1948
Michael	M	88528	1954



**UNIX:**

In [7]:
!sort -k 3,3 -nr popular-names.txt 2>/dev/null | head -n 10

Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947
Linda	F	91016	1949
Michael	M	90656	1956
Michael	M	90517	1958
James	M	88584	1948
Michael	M	88528	1954


## 19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる
各行の1列目の文字列の出現頻度を求め, その高い順に並べて表示せよ. 確認にはcut, uniq, sortコマンドを用いよ.

**Python:**

In [152]:
from collections import Counter
from itertools import islice

with open("popular-names.txt", "r") as f:
    lines = f.readlines()
    
col1_vals = [line.split()[0] for line in lines]
freq = Counter(col1_vals)

for word, count in islice(freq.most_common(), 10):
    print(f"{count:>5} {word}")

  118 James
  111 William
  108 John
  108 Robert
   92 Mary
   75 Charles
   74 Michael
   73 Elizabeth
   70 Joseph
   60 Margaret


**UNIX:**

In [8]:
!cut -f 1 popular-names.txt | sort | uniq -c | sort -r 2>/dev/null | head -n 10

    118 James
    111 William
    108 Robert
    108 John
     92 Mary
     75 Charles
     74 Michael
     73 Elizabeth
     70 Joseph
     60 Margaret


```{tip}
- [wc](https://man7.org/linux/man-pages/man1/wc.1.html)
- [sed](https://man7.org/linux/man-pages/man1/sed.1.html)
- [tr](https://man7.org/linux/man-pages/man1/tr.1.html)
- [expand](https://man7.org/linux/man-pages/man1/expand.1.html)
- [cut](https://www.man7.org/linux/man-pages/man1/cut.1.html)
- [paste](https://man7.org/linux/man-pages/man1/paste.1.html)
- [head](https://man7.org/linux/man-pages/man1/head.1.html)
- [tail](https://man7.org/linux/man-pages/man1/tail.1.html)
- [split](https://man7.org/linux/man-pages/man1/split.1.html)
- [sort](https://www.man7.org/linux/man-pages/man1/sort.1.html)
- [uniq](https://man7.org/linux/man-pages/man1/uniq.1.html)
```